Numb3rs Protyped Data Science

Data Science is a new term and a new job title that has been receiving quite a bit of hype. There have been arguments over the definition of this term, and whether or not it truly describes a new field of endeavor or is just an offshoot of statistics, software programming or business intelligence. Another take is that many of the definitions of a Data Scientist can be met by few if any individuals, but really define a team. The television show Numb3rs ran with new episodes in the USA from 2005 through 2010. In many ways, the show was a prototype for such a Data Science team. Let's look at the roles on the show, and see how they might translate into your organization.

The Mathematician or Statistician

Charlie Eppes, boy genius, who grew into a young professor of Mathematics at the fictional CalSci. Like so many professors, he supplemented his income by consulting. In his case, to the FBI and his brother, applying mathematics and statistics to solving crimes of all sorts. His breadth and depth of knowledge was remarkable; unlikely to be matched in the real world. However, an applied mathematician or statistician, with knowledge of a branch of mathematics or statistics relevant to your problems is essential in either a Data Scientist or a data science team.

Computational Statistician, Computer Scientist or Software Developer

Amita Ramanujan, a student at CalSci who achieves her doctorate in computational mathematics, becomes a professor at CalSci, helps with the consultations to the FBI and, as a side note, dates her one-time thesis advisor, Charlie Eppes. In many ways, Amita is the closest to being a Data Scientist of anyone in the show. Equally adept at mathematics, physics, statistics and hacking, Amita often acquires the required data from disparate sources, and transforms Charlie's mathematical visions into working code. If you hire an Amita, you might just have all you need to get Data Science producing real solutions for you.

Over time, both Charlie and Amita gain a fairly impressive domain knowledge of criminalistics.

Subject Matter Expert (SME)

And speaking of subject matter experts, this is where Numb3rs really prototyped what is required to make Data Science valuable in solving the crimes, er, problems at your organization. There were many SMEs, both as regular characters, and as special roles for specific shows. The regular characters included the FBI agents, from Don Eppes, the lead agent, and Charlie's brother, to his team, with David Sinclair and Colby Granger surviving the entire series. By the end of the series, these two FBI agents were taking turns suggesting and explaining mathematical approaches. Other FBI agents who were on the team for one or more seasons include Terry Lake, Megan Reaves, Liz Warner and Nikki Betancourt. Megan was also a profiler.

Also of important note, was the Eppes brothers' father, Alan Eppes. Alan was a retired city planner for Los Angeles. His knowledge of building, regulations, and city's byways, processes, neighborhoods and interactions, were often instrumental in understanding the results of Charlie and Amita's calculations.

Another regular was Larry Feinhardt, holding the Walter T. Merrick chair at CalSci. It may surprise you to learn that a theoretical physicist and cosmologist, interested in studying the heavens, string theory and zero point energy was a crucial member of a crime solving data science team, but he was, other than a brief hiatus aboard the International Space Station.

More important than these regular cast members however were the guest SMEs. There were some recurring roles, such as the drop-out with a knack for baseball stats, or the mechanical engineer who was more interested in how things failed than in how to build them. Flame propagation, biology, disease control, cryptology, cognition, gaming, chemistry, forensic accounting and more specialists all are needed at one time or another to solve the crime.

Retrospective

For me, the lesson is that while you may find an individual who can be creative with the right math or statistics, find, extract and massage data, have sufficient domain expertise and write sophisticated code, turning their creative algorithms into real solutions, you're more likely to need a team. And even that team will need additional help, from within the organization or outside consultants. Your regular team and "guest stars" mathematicians, frequentists, Bayesians, engineers, scientists, accountants, business analysts, and others to bring the best decisions out of your data.

And if you want to check out how realistic the mathematics was in Numb3rs, check out The Math of Numb3rs from Cornell University. For more on the cast, characters and show, look for the CBS official Numb3rs site, and, of course, the Wikipedia article on Numb3rs.

Update: 20120722: I'm honored by the mentions and retweets on Twitter from those who love the Numb3rs analogy. Thank you all.
.

Caggionetti

A Lone Raw Caggionetti
A Lone Raw Caggionetti
Caggionetti Dusted with Sugar and Spice
Caggionetti Dusted with Turbinado Sugar and the Spice Nutmeg
A Plateful of Caggionetti
A Plateful of Fried Caggionetti dusted with turbinado sugar and nutmeg

Caggionetti are a fried Christmas cookie from the Abruzzo region of Italy. My paternal Grandmother, Leni, made them every year. Unfortunately, no one in the family ever got her recipe. They look like a fried ravioli, filled with a chestnut paste and dusted with sugar and spices. I've been making them the past few years, playing with ingredients, and I've finally have a recipe that I wish to share. This makes between 50 & 60 cookies

The dough is made with olive oil, white wine and flour. If you don't have a pasta machine to roll out thin, flat sheets of dough, won ton wrappers may be substituted.

Pastry

4 to 4 & 1/2 cups of whole wheat pastry flour
1/3 cup of extra virgin olive oil - the fruitier the better
white wine

Mound the flour up on a [marble if you have it] pastry board, make a well in the center, add the olive oil, begin kneading the oil into the flour and add the white wine until you have a very stiff dough, similar to a pasta dough. Run it through your pasta machine at least twice until it is nice and thin.

Use a ravioli cutter, round cookie cutter or a glass to make 2 & 1/2 inch round circles of dough.

Filling

My grandmother made a filling of chestnut, cocoa, raisins, figs and hazelnuts. I've seen recipes using citron, walnuts, almonds, chocolate, or cicci instead of some or all of those ingredients, and ones with no cocoa or chocolate.

12 ounces of roasted chestnuts
1/4 cup of raisins soaked in the wine must before boiling or tawny port
1 pint of Grape or Wine must boiled down to about two ounces of syrup, if you can find it, or 1/2 cup of turbinado sugar and/or honey plus tawny port
1 cup of hazelnut meal
6 donatto figs done Melissese style with the tough stem removed, quartered length-wise and chopped coarsely
1/4 cup of fine quality, unsweetened cocoa
a few grinds of allspice

Mix all of these ingredients together.

Making the cookies

Using two spoons, take a chestnut sized ball of the filling, and make it egg shaped by scraping it between the spoons, then place in the center of a dough circle. Rub water around the outside edge of the dough. Pull the dough up around the filling, press together at the watered edge, and then crimp with a fork, turn it over, and crimp the other side.

Heat a cast iron pan, add about a quarter-inch of olive oil. When hot, add enough cookies to the oil to fill the pan. Turn every two minutes until the dough is golden brown [usually about 8 minutes total]. Transfer to a plate lined with paper towels. Allow to cool for a few minutes, and then dust with sugar and spice [I used nutmeg, but cinnamon, clove, allspice, cardamom, or any combination works too].

Mac N Cheese

Mac and Cheese
A picture of the finished macaroni and cheese dish

Mac'n'Cheese is a favourite dish, but the one place that I posted my recipe is gone now. Let's see if I can recreate it.

Inspired by an episode of Bones, I make my Mac'n'Cheese with leeks and pancetta now. For a vegetarian version, use your favourite vegie bacon, sprinkled with nutmeg and cinnamon while frying.

Bring 4 quarts of water to a boil, add a big handful of your favourite sea salt, and cook 1 pound of Rustichella d'Abruzzo penne for 8 minutes [two minutes less than the minimum recommended cooking time. Drain and set aside.

Clean by cutting off the roots and green part, and soaking the white part in salted cold water, and thinly slice two medium or one large leek(s) and sweat in 3 tablespoons sweet butter with freshly ground rainbow peppercorns until translucent. Salt to taste. Alternately, sweat in the pancetta grease or the fat in which you sautéed the vegie bacon.

Slowly add three flat tablespoons of flour and stir for two or three minutes to make a roux.

Slowly pour in three cups of milk to make a bechamel like sauce. Cube and then stir in one-half pound of [raw milk, if you can find it] asiago, one-half pound of fontinal [the Italian Fontinal, not the Danish Fontina] and on-half pound of monterey jack cheeses, until melted. Add one cup of heavy cream. Other variations may use a bit of mustard powder or seeds, Sierra Nevada mustard with stout, a few drops of Worcestershire sauce [remember it has anchovies], pesto, or any of a number of tapenades.

Add the cooked pasta to the cheese sauce and pour into a buttered glass lasagna or casserole dish.

Grate one-quarter pound of good quality parmigiano reggiano, and mix with one-half cup of fresh bread crumbs and the sautéed pancetta or vegie bacon. For the bread crumbs, I often make tiny cubes of whatever left-over bread I have around, soak in milk, squeeze nearly dry, and then add the cheese and savory. Sprinkle over the mac'n'cheese and dot with more sweet butter.

Bake at ~350ºF for 30 minutes or more, until the sauce is bubbling up around the edges and the topping is lightly browned.

And remember, recipes are guidelines, not rules. Experiment. Try different cheeses, sharper, milder, mixed. Add other stuff. Make the dish yours.

Comment to BBBT Blog on Wherescape

Today started for me with a great Bouder BI Brain Trust [BBBT] Session featuring WhereScape and their launch of WhereScape 3D [registration or account required to download], their new data warehouse planning tool. Other than my interest in all things related to data management and analysis [DMA], the WhereScape 3D tool is particularly interesting to me in its potential for use in Agile environments and its flexibility in being used with other data integration tools, not just WhereScape Red. Richard Hackathorn does a great job describing WhereScape 3D, which launched in beta at the BBBT, complete with cake for those in the room, which he's already downloaded and used. [I'm awaiting the promised cross-platform JAR to try it out on my MacBookPro.]

Unfortunately, Twitter search is letting me down today as I normally gather all the #BBBT tweets from a session, send them to Evernote, and check these "notes" as I write a blog post.

WhereScape 3D is a planning tool, allowing a data warehouse developer to profile source systems, model the data warehouse or data mart, and automagically create metadata driven documentation. Further, one can iterate through this process, creating new versions of the models and documentation, without destroying the old. The documentation can be exported as HTML and included in any web-based collaboration platform. So, there is the potential of using the documentation against Scrum style burn down lists and for lightweight Agile artifacts.

WhereScape 3D and Red come with a variety of ODBC drivers, and, with the proper Teradata licensing, the Teradata JDBC driver as well. One can also add other ODBC and JDBC drivers. However, neither WhereScape product currently allows connections to non-relational database sources. I would find this to be severely limiting, as in traditional enterprises, we've never worked on a DMA project that didn't include legacy systems requiring us to pull from flat files, systems written in Pick Basic against a UniVerse or other multi-value database management system [MVDBMS], electronic data interchange [EDI] files, XML, or java or ReSTful services. In other cases, we're facing new data science challenges of extreme volumetric flows of data from web, sensor and transaction logs, requiring real-time analytics, such as can be had with SQLstream, or stored in NoSQL data sources, such as Hadoop and its offshoots.

Which leads us to another interesting feature of WhereScape 3D: it's designed to be used with any data integration tool, not just WhereScape Red. I'm looking forward to get that JAR file, currently hiding in a MS Windows EXE file, and trying WhereScape 3D in conjunction with Pentaho Data Integration [PDI or KETTLE] and seeing how the nimble nature of WhereScape 3D planning works with PDI Spoon AgileBI against all sorts of data flows targeting LucidDB ADBMS and data vault. Yeehah!

Full360 on BBBT

Today, Friday the 13th of May, 2011, the Boulder BI Brain Trust heard from Larry Hill [find @lkhill1 onTwitter] and Rohit Amarnath [find @ramarnat on Twitter] of Full360 [find @full360 on Twitter] about the company's elasticBI™ offering.

Serving up business intelligence in the Cloud has gone through the general hype cycles of all other software applications, from early application service providers (ASP), through the software as a service (SaaS) pitches to the current Cloud hype, including infrastructure and platform as a service (IaaS and PaaS). All the early efforts have failed. To my mind, there have been three reasons for these failures.

  1. Security concerns on the part of customers
  2. Logistics difficulties in bringing large amounts of data into the cloud
  3. Operational problems in scaling single-tenant instances of the BI stack to large number of customers

Full360, a 15-year-old system integrator & consultancy, with a clientele ranging from startups to the top ten global financial institutions, has come up with a compelling Cloud BI story in elasticBI™, using a combination of open source and proprietary software to build a full BI stack from ETL [Talend OpenStudio as available through Jaspersoft] to the data mart/warehouse [Vertica] to BI reporting, dashboards and data mining [Jaspersoft partnered with Revolution Analytics], all available through Amazon Web Services (AWS). Full360 is building upon their success as Jaspersoft's primary cloud partner, and their involvement in the Rightscale Cloud Management stack, which was a 2010 winner of the SIIA CODiE award, with essentially the same stack as elasticBI.

Full360 has an excellent price point for medium size businesses, or departments within larger organizations. Initial deployment, covering set-up, engineering time and the first month's subscription, comes to less than a proof of concept might cost for a single piece of their stack. The entry level monthly subscription extended out for one year, is far less than an annual subscription or licensing costs for similar software, considering depreciation on the hardware, and the cost of personnel to maintain the system, especially considering that the monthly fee includes operations management and a small amount of consulting time, this is a great deal for medium size businesses.

The stack being offered is full-featured. Jaspersoft has, arguably, the best open source reporting tool available. Talend Open Studio is a very competitive data integration tool, with options for master data management, data quality and even an enterprise service bus for complete data integration from internal and external data sources and web services. Vertica is a very robust and high-performance column-store Analytic Database Management System (ADBMS) with "big data" capabilities that was recently purchased by HP.

All of this is wonderful, but none of it is really new, nor a differentiator from the failed BI services of the past, nor the on-going competition today. Where Full360 may win however, is in how they answer the three challenges that caused the failure of those past efforts.

Security

Full360's elasticBI™ handles the security question with the answer that they're using AWS security. More importantly, they recognized the security concerns as one of their presentation sections today stated, "Hurdles for Cloud BI" being cloud security, data security and application security. All three of these being handled by AWS standard security practices. Whether or not this is suficient, especially in the eyes of customers, is uncertain.

Operations

Operations and maintenance is one area where Full360 is taking great advantage of the evolution of current Cloud services best known methods and "devops" by using Chef opscode recipes for handling deployment, maintenance, ELT and upgrades. However, whether or not this level of automation will be sufficient to counter the lack of a multi-tenant architecture remains to be seen. There are those that argue that true Cloud or even the older SaaS differentiators and ability to scale profitably at their price-points, depends on multi-tenancy, which causes all customers to be at the same version of the stack. The heart of providing multi-tenancy is in the database, and this is the point where most SaaS vendors, other than salesforce-dot-com (SFDC), fail. However, Jaspersoft does claim support for multi-tenant architecture. It may be that Full360 will be able to maintain the balance between security/privacy and scalability with their use of devops, and without creating a new multi-tenant architecture.Also, the point of Cloud services isn't the cloud at all. That is, the fact that the hardware, software, platform, what-have-you is in a remote or distributed data center isn't the point. The point is the elastic self-provisioning. The ability of the customer to add resources on their own, and being charged accordingly.

Data Volume

The entry-level data volume for elacticBI™ is the size of a departmental data mart today. But even today, successfully loading into the Cloud, that much data in a nightly ETL run, simply isn't feasible. Full360 is leveraging Aspera's technology for high-speed data transfer, and AWS does support a form of good ol' fashioned "sneaker net", allowing customers to mail in hard drives. In addition, current customers with larger data volumes, are drawing that data from the cloud, with the source being in AWS already, or from SFDC. This is a problem that will continue to be an "arms race" into the future, with data volumes, source location and bandwidth being in a three-way pile-up.

In conclusion, Full360 has developed an excellent BI Service to suplement their professional services offereings. Larger organizations are still wary of allowing their data out of their control, or may be afraid of the target web services provide for hackers, as exemplified by the recent bank & retailer email scammers, er marketing, and Sony break-ins. Smaller companies, which might find the price attractive enough to offset security concerns, haven't seen the need for BI. So, the question remains as to whether or not the market is interestd in BI in the Cloud.

This post was simultaneously published on the Blog of the Boulder BI Brain Trust, of which I'm a member.

October 2018
Mon Tue Wed Thu Fri Sat Sun
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31        
 << <   > >>
The TeleInterActive Press is a collection of blogs by Clarise Z. Doval Santos and Joseph A. di Paolantonio, covering the Internet of Things, Data Management and Analytics, and other topics for business and pleasure. 37.540686772871 -122.516149406889

Search

Categories

The TeleInterActive Lifestyle

Yackity Blog Blog

The Cynosural Blog

Open Source Solutions

DataArchon

The TeleInterActive Press

  XML Feeds