Springbok by Informatica is the latest entry in the nascent self-service data preparation market. Springbok is impressive on several fronts.
Rather than data preparation, Informatica uses the term data harmonization, to emphasize the capabilities within Springbok to bridge the divide between the business and information technology. This feature truly differentiates Springbok. Other products for self-service data preparation focus 100% on the business side.
Springbok is truly a self-service tool. Though it is possible to integrate with other Informatica products, Springbok is a Cloud offering, available today, wherein you can upload your data all by yourself. For free. Try it yourself at
These are Informatica’s terms that represent two sides of the same coin: identification of the most valued data players and the most trust data sources, allowing collaboration among business users of those data sets, and visibility to IT into the business use of data. Springbok ranks data users, as well as their Springbok recipes, data sources and data permutations to allow other users of that data to have confidence in unfamiliar data sources. Additionally, IT gains understanding of what internal and third-party business users are actually using and how they are actually using that data; all before a business user makes a request. This prevents IT being blind-sided by Shadow IT.
While Springbok is fully a Cloud product, it easily connects to both on-premises and Cloud data sources. The family traits from Informatica’s long history in data integration show up here.
Why has a self-service data preparation market come to be, fast on the heels of adoption of self-service BI tools, such as Tableau and Qlik? To solve a problem. With the advent of next generation BI tools and the trend towards self service Data Management and Analytics (DMA) business users manipulate the data themselves. They always have. The first question that we are always asked in a data warehouse or business intelligence project is “Can I export that to Excel?" As Big Data and Data Science have moved from buzzwords to business practices, it has become widely known that 80% of Data Analytics process involves preparing the data for use by locating, cleansing and standardizing the data. Whether this is done by a Data Scientist using Unix shell tools like sed and awk, or as an iterative process between IT and business, it is time consuming. It is also boring. IT gets caught in the dilemma of handling the increasing data preparation requests and is playing catch up.
Springbok is a self service data harmonization tool that empowers business users to find the data and guide them through the process of enriching and shaping the data without the need for deep technical skills nor dependence on outside help. Let’ s take a closer look at the capabilities Springbok brings to all users along the gradient from non-technical to quant to IT specialist.
Project springbok provides a quick and easy way to take data from one source and accurately combine them together. It provides the ability to automatically suggest data for data enrichment. Using any single file, any combination of sources, or all available data sources, Springbok suggests completion of spotty records or of an entire data set required for analysis through semantic analysis of the data. For example, if a column of data contains city names, but some records are blank, Springbok can use a Zip Code column within that file, or a Golden Record from a Master Data management System, or a third-party source, such as Dun & Bradstreet, to complete the data set.
Springbok Promotes business user collaboration by allowing business users to access correct data, to know who is the person responsible for the evolution of that data, and to understand the lineage of that data. This promotes collaboration in the enterprise through reputation building trust. Within Springbok, a user can find other users and other data sources that their peers trust and use. This is invaluable to both new employees and to old-timers being confronted by new sources of data and changing business processes.
The basic tenant of the design philosophy of Springbok is self-service from a user uploading a file to the Springbok Cloud and immediately being able to play with their data, to that same user being able to export to the Self-Service BI tool of their choice.
This feature is a major differentiator of Springbok. IT is able to have visibility and understand the evolution of a data set as well as the identity of key data influencers. This promotes collaboration between IT and their respective business partners. Permutation Management also aids in finding key external sources, shining a light into Shadow IT. Further, Springbok is a stand-alone product; however, for Informatica customers, data is easily centralized with one-click to bring the business users’ recipes into informatica Power Center for production use in the analytic environment. This last function does raise questions about configuration management, change control, and regulatory compliance. We were assured by Informatica that this is a consideration for Springbok, and again, those enterprise roots show. This capability will be the customer's choice on how they wish to use it. Regulatory compliance and traceability will be handled by exposing the Springbok logs for audits. Notice the “will be”. The one-click instantiation of a Springbok recipe as a PowerCenter transformation is on the roadmap, but not available in the current version of Springbok, which is freely available.
One of the most impressive things about Springbok is the rapid adoption among Informatica customers and non-customers. One hundred users representing approximately 30 Informatica customers participated in the development of Springbok. In the three months since the announcement of the public Springbok beta program, over 1700 (now 2300 since our last briefing) users from more than 350 (now:500) organizations have been uploading data into the Springbok cloud, and happily manipulating that data. One other area where Informatica has recently delighted us, is the growth of the Informatica Marketplace. We are looking forward to the day when users can contribute non-proprietary Springbok recipes to the Marketplace. In today’s connected world, data management and analytics is the competitive edge. Participation in such a wide-ranging community provides the cross-fertilization necessary to fully leverage the changes coming about through evolving technologies from social media to the Internet of Things.
The week of 2013 October 28 was a big one for Paxata, Inc. Founded in January of 2012, followed by advisories, beta customers also known as "Pax Pros", and 12 sprints, Paxata quietly released their first GA product in May of 2013. With panels and debuts at the Strata + Hadoop conference in New York and other events, leading up to announcements and demonstrations at the Constellation Connected Enterprise at the Ritz Carlton in Half Moon Bay, California, Paxata officially left stealth mode, publicly discussing:
The most wondrous feature of the Paxata Adaptive Data Preparation Platform is how it adds semantic richness to one's data sets by automatically recommending and linking to third-party and freely available data. This allows one to bring in firmographic, demographic, social and machine data within the context of the user's goals. This is what truly allows the Paxata Adaptive Data Preparation Platform to go beyond data exploration and discovery.
Paxata has received a fair amount of press as well, some of which I've referenced below. However, all this press misses what is one of the most important additions Paxata makes to the toolboxes of Data Management & Analytics [DMA] professionals… the ability to present questions to the user that they may not have thought of on their own. Paxata was one of the companies that inspired my DataGrok blog post. Paxata was in stealth at the time, and couldn't be named then. Now, I'm happy to be able to write that Paxata is one of the few companies or projects building tools that allow the creator and user of data to go beyond data discovery, beyond data exploration, to being able to fully, deeply understand their data. Data discovery and data exploration tools allow one to determine if various data sets can answer the questions posed by business, engineering or scientific challenges. These tools go further by exposing data integrity issues among data sets or data quality problems within a data set. Some such tools might help the user find new data sets or how various data sources within an organization might fit together in a data warehouse. Some hark back to grep, sed and awk to parse textual data. Others provide probabilistic and statistical tools to determine the appropriate shape, distribution or density functions of a data set. But Paxata is one tool that does all these and more, and does it through your web browser in a collaborative fashion, maintaining the history of each collaborator's operations on the data sets.
When my partner, Clarise, and I were first briefed by Paxata in November of 2012, we were so excited that we stayed over three hours. The demonstration, of what was then a much rougher product than what you see today, incited both of us to exclaim how much we wished that we had this tool back in our DMA practitioner days. We were treated to a demonstration using the data from another Constellation Research customer with which we were familiar. Over a year later, we were treated to a pre-launch briefing using current data sets from that same customer. The ease of use, the pleasantness of the user experience, the simplicity with which one could complete complex tasks, from histograms to column-splitting, showed the maturity that Paxata had gained since our first exposure. What was most important to us, was that Paxata could show a solution for every need that we would like to see in the Adaptive Data Preparation Platform, from our experiences in implementing data warehousing and business intelligence programs since 1996, as well as our decades of experience in computational statistics and operations research.
It allows data warehousing and BI extract, transform and load professionals, business analysts, data scientists, chemists, physicists, engineers, researchers, and professionals of all skills who work with data to completely understand and resonate with their data sets. The Paxata Adaptive Data Preparation Platform does what few other tools can do, it provides clues to what you didn't know to ask. It poses questions that the data can answer, but that you didn't think to ask. And it does all of this in a familiar looking interface, in HTML 5, in your favorite web browser, wherever you are, whenever you need it. In Paxata's words:
Paxata pricing is published and open. There are three subscriptions available:
Each of the Paxata subscriptions build upon the first, from an individual subscription to the ability for those with individual subscriptions to share in a single environment, to a full organization-wide subscription. Of course, what makes this possible, is that the Paxata Adaptive Data Preparation platform is available as a Cloud service, accessible through any modern HTML 5 web browser whether that's from a sophisticated, high-end workstation, a tablet or smart phone.
The main value comes not from a nice-looking, fairly intuitive interface, but from the underlying technologies that makes Paxata so useful: powerful Mathematics, Semantics and Graph Theory algorithms. The results of which are easily accessible through this Cloud-based, web experience, while the complexities are under the covers, not getting in the way. This fact is what makes the Adaptive Data Preparation Platform so accessible to business analysts, and other creators and users of data who are not PhD statisticians. Paxata uses proprietary algorithms that detect relationships among data sets, using probabilistic techniques to select the best joins, semantically typing the data so that it can intelligently enrich the data, clean the data and merge the data based upon context not just metadata. All of this is done in an ad hoc fashion, with no predefined models or schæmas needed. These proprietary algorithms make use of
Distributed computing and in-memory technologies allow these computational statistics algorithms to be,cost effectively executed in parallel, across massive data sets. Coupled with the advancements in visualization technologies, Paxata is able to address a 13.5-16 Billion dollar market over next three years, with extremely attractive pricing. The true return on investment from Paxata comes from flipping the DMA equation around. Currently, a common truism is that 80% of the time on a DMA, Data Science, DW or BI project is spent in preparing data; 20% in analyzing the data. Paxata reduces that data preparation percentage, such that 70% is analytics, 30% is preparation. This reduces not only the labor directly involved in preparing the data, but also allows an Agile framework to address significant business needs at the right time, in a sustainable fashion.
Paxata's strategy is to attach to the QlikView and Tableau markets that are being hampered from enterprise adoption because of these very data preparation challenges. Along with these partnerships, is the partnership with Cloudera, providing enterprise class access to modern, distributed data storage systems. Add connectors to common enterprise and external data sources and the third-party Paxata Enrichment Libraries, and it is obvious to the most casual observer that the Paxata Adaptive Data Preparation Platform addresses the most frustrating complaint of Data Scientists and Business Analysts alike: that too much of their time is spent on plumbing, whether directly or waiting for IT. We have long spoken about the need for IT to give up control of data, and realize that their most effective role is to provide a framework of success for end-users to fully, deeply understand and use their data to solve real problems. Paxata creates this framework for success.
Other Sources to learn about the Paxata launch:
Pentaho offers one of the most complete data management and analytics suites available both as an open source solution, its Community Edition, and as an Enterprise Edition:
Webdetails is a 20-person strong consultancy based in Portugal, founded by Pedro Alves, focused on building Pentaho solutions for its customers, and on data visualization. In addition to the consulting work, Webdetails has become the major committer for the open source Community Development Framework project, originally developed by Ingo Klose. In the course of their work, as inspired by the muse of customer needs, Webdetails has grown the original CDF project into a full suite of OSS data visualization and dashboard projects, CTools. Over the past year, the talented web details user experience teams, seems to have put out a new CTool almost monthly.
Pedro Alves is an extremely well-respected member of the Pentaho community, leading community events and training, appearing often in the forums and IRC, and staying connected through Twitter and Skype. Recently, Pedro was highly active in helping to create the Pentaho Marketplace, which provides direct access from the BI Server web interface for users, to a series of plug-ins for the BI Server, including CTools, and other community and third-party projects.
I have the pleasure of knowing Pedro, and several other of the Webdetails and Pentaho teams. This week I was able to speak with Pedro, as well as Davy Nys, Vice President, EMEA & APAC at Pentaho, and Doug Moran, one of Pentaho's Founders.
Pedro doesn't feel that the acquisition will change Webdetails, in that both the UX and consulting teams will continue as before. However, both community and enterprise users of Pentaho will feel the impact of both teams, as the lessons learned from Webdetails consulting projects are implemented by the UX team, not only in the Dashboards and data visualizations tools, but also, per Davy, in the overall UX throughout all the Pentaho products. Having worked with Pentaho tools as a practitioner in the past, I know that business users will appreciate this as Pentaho becomes both easier and more pleasant to use. The data scientists will also appreciate more and better tools to draw the story out of the data, and present it to the subject matter experts and business leaders in an Agile fashion.
As Pedro mentioned, most things won't change, such as the fact that CDF is the underpinning of all of Pentaho Dashboards, or the pace of development of new CTools. Several are currently underway. One that I can mention grew out of a request by the Mozilla Foundation, for a file and data browser for the Hadoop distributed file system [HDFS] that would be as easy as the file browser in any modern operating system. The result is CVB - community VFS browser. One thing that will change is that more of the CTools will make their way into the main branch of the EE product as they reach the appropriate state of maturity and stability.
Pedro has many plans for CTools, and for facilitating data visualization through Pentaho. But in addition to continuing his role as the general manager of Webdetails, and Chief Architect of CTools, Pedro will also be assuming the role of Senior Vice President of Community for Pentaho. As a long time friend of the Pentaho community myself, I have to say that there couldn't be a better choice.
One of Pentaho's Founders, Doug Moran, was the "Community Guy", who stayed in this role until the start of 2011, following the original community guy, Gretchen Moran. Doug's philosophy is that any open source community needs to stand on its own to be organic and strong. The Pentaho community is one of the strongest in the OSS DMA space, and as a result, Doug felt comfortable focusing elsewhere, and assumed management of all of Pentaho's "big data" products and Instaview initiatives. As SVP of Community, Pedro will be mostly focused within the company to integrate the community internally and help drive the corporate strategy for community. He'll continue to participate in the community, but as the Pentaho BeeKeeper model, developed by Pentaho CTO & Chief Geek, James Dixon, his main concern will be to assure that there is a rich environment for community innovation. As part of that, Pedro will also be actively pursuing ways to grow and leverage the Pentaho Marketplace. Doug also pointed out that the Pentaho community is also hugely valuable for QA and as a training ground for the best Pentaho developers. This is sure to continue with Pedro in his new role. Doug and Pedro have worked together since the early days of Pentaho, when Pedro decided to quit his job, and, with his wife, create a company devoted to professional services for Pentaho projects and products. This strong relationship between the original Community Guy and the new SVP of Community can only help to make an already strong community even better and more creative.
Davy pointed out to me that there has been an increase in customer demand for Dashboards that were in essence, apps within Pentaho. This might happen through a plan that Pedro has to make it very easy to create such dashboard-based apps without any programming ability, and then publish them to the Marketplace. This planned community plugin kick-starter [CPK] will use CDE to create the front-end, and the Pentaho Data Integration software, KETTLE and Spoon, for the backend logic. I believe that both internal and external consultants, integrating Pentaho into an organization's decision making process, will find this ability exciting, as many of these system integrators are not Java developers. The ability to push such apps to the Marketplace will also be embraced by both CE and EE users, as most customers are excited by the idea of openly sharing their solutions, and enjoy the resulting community recognition.
Webdetails fits very well into creating a finer exploratory analytics experience for the customers, and will make Pentaho a superior choice for big data. Combined with Instaview, and with the proper roadmap, it may even push Pentaho into the new Data Grok market, not only helping users answer the questions they have, but actually pointing out the questions that the data set can answer, even if the user didn't think of it.
Both CE and EE users and customers of Pentaho should welcome this acquisition, and look forward to the better UX and data visualization. Most importantly, they should plan on how they can contribute to, and benefit from the Pentaho Marketplace, as it becomes an important part of the Pentaho ecosystem.
Cara is coming to a brick-and-mortar store near you. But don't be insulted when she doesn't recognize who you are.
Recently, I met with Jason Sosa, the CEO and Founder of IMRSV, Inc… twice. What came through to me was his passion for understanding the societal and human impacts of the technologies he creates and brings to market. This passion makes their mantra of and adherence to "privacy by design" very real and central to their approach.
Cara is the core software product from IMRSV, Inc. Cara analyses your face, and determines demographic, attention and emotive statistics about you, without attempting to identify you. As IMRSV states, Cara turns any connected camera into an intelligent sensor, but does so anonymously. Move from one Cara camera to another, or move away and back again to the same Cara camera, and the temporary ID number associated with you changes.
While Cara is pre-launch, I'm excited both by the technology, and by the IMRSV, Inc business model. The business model is very simple, whether a small shop owner or a developer interested in using Cara as part of a sensor analytics ecosystem, you pay $39.95 per camera, which includes the stand-alone Cara software and the Cloud-based data-as-a-service. The possibilities presented by Cara are what really got me going, fueling both an exciting initial briefing and a follow-up four-hour "lunch" and demo.
What's does any of this mean? Here's a few examples.
In addition to starting companies, Jason is very interested in the Singularity, and the impending impact of technology upon human employment and self-identity. This has led both to the "Privacy by Design" and "Principles of Good Use" for developers/partners. If you don't believe me, maybe you'll believe Jules Polonetsky.
"Privacy by design solutions are critical to implementing new technologies in a world were data collection has become ubiquitous. Steps that Cara takes such as not collecting any personal information, and not storing, transferring or recorded any images are key to ensuring privacy concerns are addressed as these technologies are rolled out.”
- Jules Polonetsky
There are various pieces of research out there that show that the Internet of Things will be a 15 trillion dollar market right now. By 2020, I strongly believe that there will be over a trillion sensors deployed and that if your "thing" isn't connected, it won't be a viable product. Companies like IMRSV, Inc are providing the ecosystem to allow sensor analytics from everyday objects at very affordable prices. This will push this market even further and faster than the pundits anticipate. So, let me put on my tinfoil hat and stand on my soap box:
Big Data is a catchy phrase. Unfortunately, it is often misused and misunderstood. Often, Hadoop and Big Data are used interchangeably; as if the Apache Hadoop family of projects are the only solutions for Big Data, or that that only use for these projects is from Big Data. Neither is true.
As an EDW/BI practitioner, I watched the Hadoop, or really, the Map/Reduce framework, be embraced and forced into being by software developers who were frustrated by Structured Query Language (SQL) and the need to create Entity-Relationship Diagrams (ERD) as data models or schæmas. They were equally unhappy with the various work-arounds to access Relational Database Management Systems from within their programs, such as Object Relational Models (ORMs) and Data Access Objects (DAOs). At first, I felt that these developers were simply lazy.
However, as I worked more with these so-called NoSQL technologies, it helped to clarify the dissatisfaction that I felt during the years I was leading EDW and BI projects. Thirty years ago, I worked in Aerospace System Engineering, developing methods and algorithms for risk assessment using Bayesian statistics. But, by 1996, I became involved in my first EDW project. Since then, the actual structure and functions associated with the data - defined by the data, became less important than fitting the data into an artificial structure imposed by business process models.
Don't get me wrong. Relational algebra, relational calculus and the DBMS technologies that came out of this mathematics, are all very useful. And, in the right hands, SQL is a very powerful language. ERDs provide a wonderful way to map data to business processes and to both transactional and analytic systems.
But… There is so much more that can be done with the data coming from traditional human-to-machine (H2M) interactions, but increasingly from human-to-human (H2H), machine-to-machine (M2M) and machine-to-human (M2H) exchanges. The interweaving of the flows of data from such disparate sources is what drives my research today.
These, and over 70 other use cases that I'm cataloguing, come from the innovation surrounding hype of Big Data, and the Data Science movement. In a recent Quark, I've classified this innovation into 11 areas. A compete mindmap is linked from the initial mindmap shown below, and in the report.
The Quark covers the trends coming from these innovations, and develops the four keys required to bring valuable decision making processes into your organization from these innovations. It's entitled "Big Data: It's Not the Size, It's How You Use It". For such a simple report, it took over 8 months to develop. Mostly this delay was caused by the fast-paced evolution of the innovations. The executive summary from the Quark is linked from the title.
I hope that you find that information, as well as the mindmap, useful in incorporating inference, prediction, insight and performance with intuition for making better decisions.
|<< <||> >>|