Category: "Data Mining"

BayAreaUseR October Special Event

Zhou Yu organized a great special event for the San Francisco Bay Area Use R group, and has asked me to post the slide decks for download. Here they are:

No longer missing is the very interesting presentation by Yasemin Atalay showing the difference in plotting analysis using the Windermere Humic Aqueous Model for river water environmental factors, without using R and then the increased in variety and accuracy of analysis and plotting gained by using R.

OpenSQLcamp Play with Data

On November 14 and 15th, I attended openSQLcamp 2009 in Portland, OR. It was a great event, and I was honored to be accepted for a five minute lightening talk: "I Play with Data". I would like to thank Sheeri (aka tcation) for providing Youtube videos of the lightening talks. Here's mine:

And here's a transcript, with links to things that were mentioned.

Hi mine name is Joseph, and I play with data.

It's good that I followed David [David J. Lutz, Director of Technical Sales Support, Infobright] because part of what I'm looking for, in the solution of how to do statistics with SQL, is column-store databases.

Way back in the 70's & 80's, I was doing pair programming with FORTRAN programmers [laughter in background] :D turning algorithms into software. I was, pair programming, we sat down together, I would write math, they would write software, we did things [mostly in Bayes], through the 80's [most with Wendy, who still works with me occasionally].

Then I started playing with data through other people algorithms using SQL, and relational database management systems, and then later, Business Intelligence systems, and most recently playing a lot with Pentaho, using that.

And I'm going to make a lot of statements, but I really have a question. I know of three ways that I can start doing real statistics with SQL databases. And I want to do real statistics because the most you can get just with AVERAGE, is, assuming that I have a uniform distribution or a normal distribution, and even in many cases, an average isn't necessarily the mean, and the mean is certainly not the best descriptor of the underlying distribution of the data. Right?

So, I can start doing fancier algorithms in SQL, but they're painful. And you know the big-O number, and they're nasty big-O numbers, to do, even if I have a frequency function, to try to arrive at the mean or the mode, simple things.

And if I want to do Bayesian statistics, and a Markov Chain Monte Carlo simulation to get at inferences on mathematical conjugates [snickering in the background] &#59;) … I'm not going to do this in SQL.

So, I have two other choices that I've been exploring.

Anyone here familiar with the R Project? [Several affirmative responses] Ya! Yeah! All right! I love the R Project, and I'm having a lot of fun with the R Project. The R Project is written in R and C and FORTRAN and there are thousands of packages written in FORTRAN and C and R and I'm doing a lot of nice math with it now, and that's a lot of fun. But everything in R is actually in data sets, and data sets are column-store databases, in memory. And even though you can get 8GB of memory on a lap top now, I run out of memory, frequently, with the type of stuff I do. So, what do I do? I use SQL, because relational database management systems, manage data really, really well, and R analyzes the data really, really well, and R speaks SQL through either RODBC, or DBI… Off you go.

So, I would like to use column-store databases, and one of my questions is that I'm looking for a way of speeding this up, so that I can match a column-store data set in R in memory with a column-store database such as Infobright or MonetDB or LucidDB. And do this one-to-one mapping much more efficiently than I can going through ODBC.

Does anyone have any thoughts on this?

[Discussion with someone in the audience - if you read this, please identify yourself in the comments, and thank you for talking to me] Have you heard of TL/R [my error in listening]?

I have not. I've never heard of TL/R.

It's R embedded in PostgreSQL.

OK, yes, I have. Did you say TL or PL?

PL. [PL/R by Joe Conway is back in development and becoming interesting again].

Yeah, PL/R I know. And there's a lot of things like that, but they're basically interfaces.



Yeah, which isn't all that mature. It tries to map the name of the dataframe in R, where you're doing your stuff in R, to a table in MySQL [in the weeds]. Which is really what you want, is to prodSQL, is that relationship of the sets, where basically you overloaded the dataframe… so you can access… overloaded the access operator… to go out to a SQL area, however it does it.


A third solution that I've been looking at is LucidDB, which is a column-store database with a plug-in architecture, written in Java. And there is the math commons on [oops] packages which have real statistic packages, probability distribution packages, all sorts of really neat packages, which are essentially Java libraries and I would like to see real statistics written into LucidDB as plug-ins for LucidDB [horn sounds] If anyone is interested. Thank you so much.

The notes taken during the lightening rounds were written by Ben Hengst, and can be found at openSQLcamp Lightening Talks

That last part is really the most important to me. I'm working with Nick Goodman, who recently started Dynamo Business Intelligence, and with advice from Julian Hyde and others in the Eigenbase community, to develop plugins for LucidDB which might be bundled into ADBMS versions of DynamoDB to do real statistics, making real inferences and real predictions, using the math packages from the Apache Commons, and having a transparent interface to R, so that R isn't limited by in-memory constraints.

Why not join us on ##luciddb and discuss it?

Complex Data Visualization at MySQL BI DW BoaF

Just got home from the MySQL Data Warehousing and BI Birds of a Feather gathering (BoaF). I'm tired, but my mind is on overdrive. 'Tis a great feeling.

First, I want to thank Lance Walter of Pentaho for introducing Clarise and me to the group as publishers of the OSBI Lens on Squidoo, this blog and the OSS wiki.

Clarise and I had a great conversation with Dr. Jacob Nikom of the MIT Lincoln Laboratory. The conversation ranged from the great Chimay beer that Matt Casters of Pentaho and Lead Architect of KETTLE, brought with him from Belgium, to

  • Data Modeling and Relational Algebra/Theory: the integrity of the model
  • Bayesian Statistics, Weibull Analysis and Tensor Calculus for mathematical modeling of complex systems [I love it when n-dimensional eigenvalues start floating in front of my eyes]
  • Meeting the needs of different types of users: managers, scientists, business folk
  • supplementing historical data warehouses with [near] real time data using ESB and dashboards
  • data visualization of complex data sets such that the analyses and limitations can be grasped at different levels by different users
  • collaboration among distributed workgroups of disparate career backgrounds and cultural pre-dispositions
  • use of Second Life and other virtual worlds for collaboration and data visualization
  • a calculator is to a computer [think if statment] as a flat file is to a relational database [think where clause]
  • early USSR vs. British knock-offs of IBM mainframes
  • Complexity as the balance of robustness and fragility

At various times in this discussion, we were joined by Sherman Wood, Director of BI at JasperSoft, and one of the legendary Mondrian developers, Julian Hyde of Pentaho and Mondrian Lead Architect, and Nicholas Goodman Director of BI for Pentaho.

And if you put Nick and dashboards and virtual worlds in the same post, then you have to mention Discoverer meets Duke Nukem.

Jakob, et al, thank you so very much for a great conversation.

April 2018
Mon Tue Wed Thu Fri Sat Sun
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
 << <   > >>

At the beginning, The Open Source Solutions Blog was a companion to the Open Source Solutions for Business Intelligence Research Project, and book. But back in 2005, we couldn't find a publisher. As Apache Hadoop and its family of open source projects proliferated, and in many ways, took over the OSS data management and analytics world, our interests became more focused on streaming data management and analytics for IoT, the architecture for people, processes and technology required to bring value from the IoT through Sensor Analytics Ecosystems, and the maturity model organizations will need to follow to achieve SAEIoT success. OSS is very important in this world too, for DMA, API and community development.

37.652951177164 -122.490877706959


  XML Feeds


Our current thinking on sensor analytics ecosystems (SAE) bringing together critical solution spaces best addressed by Internet of Things (IoT) and advances in Data Management and Analytics (DMA) is here.

Recent Posts

powered by b2evolution free blog software