Springbok by Informatica is the latest entry in the nascent self-service data preparation market. Springbok is impressive on several fronts.
Rather than data preparation, Informatica uses the term data harmonization, to emphasize the capabilities within Springbok to bridge the divide between the business and information technology. This feature truly differentiates Springbok. Other products for self-service data preparation focus 100% on the business side.
Springbok is truly a self-service tool. Though it is possible to integrate with other Informatica products, Springbok is a Cloud offering, available today, wherein you can upload your data all by yourself. For free. Try it yourself at
These are Informatica’s terms that represent two sides of the same coin: identification of the most valued data players and the most trust data sources, allowing collaboration among business users of those data sets, and visibility to IT into the business use of data. Springbok ranks data users, as well as their Springbok recipes, data sources and data permutations to allow other users of that data to have confidence in unfamiliar data sources. Additionally, IT gains understanding of what internal and third-party business users are actually using and how they are actually using that data; all before a business user makes a request. This prevents IT being blind-sided by Shadow IT.
While Springbok is fully a Cloud product, it easily connects to both on-premises and Cloud data sources. The family traits from Informatica’s long history in data integration show up here.
Why has a self-service data preparation market come to be, fast on the heels of adoption of self-service BI tools, such as Tableau and Qlik? To solve a problem. With the advent of next generation BI tools and the trend towards self service Data Management and Analytics (DMA) business users manipulate the data themselves. They always have. The first question that we are always asked in a data warehouse or business intelligence project is “Can I export that to Excel?" As Big Data and Data Science have moved from buzzwords to business practices, it has become widely known that 80% of Data Analytics process involves preparing the data for use by locating, cleansing and standardizing the data. Whether this is done by a Data Scientist using Unix shell tools like sed and awk, or as an iterative process between IT and business, it is time consuming. It is also boring. IT gets caught in the dilemma of handling the increasing data preparation requests and is playing catch up.
Springbok is a self service data harmonization tool that empowers business users to find the data and guide them through the process of enriching and shaping the data without the need for deep technical skills nor dependence on outside help. Let’ s take a closer look at the capabilities Springbok brings to all users along the gradient from non-technical to quant to IT specialist.
Project springbok provides a quick and easy way to take data from one source and accurately combine them together. It provides the ability to automatically suggest data for data enrichment. Using any single file, any combination of sources, or all available data sources, Springbok suggests completion of spotty records or of an entire data set required for analysis through semantic analysis of the data. For example, if a column of data contains city names, but some records are blank, Springbok can use a Zip Code column within that file, or a Golden Record from a Master Data management System, or a third-party source, such as Dun & Bradstreet, to complete the data set.
Springbok Promotes business user collaboration by allowing business users to access correct data, to know who is the person responsible for the evolution of that data, and to understand the lineage of that data. This promotes collaboration in the enterprise through reputation building trust. Within Springbok, a user can find other users and other data sources that their peers trust and use. This is invaluable to both new employees and to old-timers being confronted by new sources of data and changing business processes.
The basic tenant of the design philosophy of Springbok is self-service from a user uploading a file to the Springbok Cloud and immediately being able to play with their data, to that same user being able to export to the Self-Service BI tool of their choice.
This feature is a major differentiator of Springbok. IT is able to have visibility and understand the evolution of a data set as well as the identity of key data influencers. This promotes collaboration between IT and their respective business partners. Permutation Management also aids in finding key external sources, shining a light into Shadow IT. Further, Springbok is a stand-alone product; however, for Informatica customers, data is easily centralized with one-click to bring the business users’ recipes into informatica Power Center for production use in the analytic environment. This last function does raise questions about configuration management, change control, and regulatory compliance. We were assured by Informatica that this is a consideration for Springbok, and again, those enterprise roots show. This capability will be the customer's choice on how they wish to use it. Regulatory compliance and traceability will be handled by exposing the Springbok logs for audits. Notice the “will be”. The one-click instantiation of a Springbok recipe as a PowerCenter transformation is on the roadmap, but not available in the current version of Springbok, which is freely available.
One of the most impressive things about Springbok is the rapid adoption among Informatica customers and non-customers. One hundred users representing approximately 30 Informatica customers participated in the development of Springbok. In the three months since the announcement of the public Springbok beta program, over 1700 (now 2300 since our last briefing) users from more than 350 (now:500) organizations have been uploading data into the Springbok cloud, and happily manipulating that data. One other area where Informatica has recently delighted us, is the growth of the Informatica Marketplace. We are looking forward to the day when users can contribute non-proprietary Springbok recipes to the Marketplace. In today’s connected world, data management and analytics is the competitive edge. Participation in such a wide-ranging community provides the cross-fertilization necessary to fully leverage the changes coming about through evolving technologies from social media to the Internet of Things.
Vibe Data Stream [Vibe] and Virtual Data Machine [VDM] combine at the center of Informatica’s Internet of Things strategy. Primarily for Machine-to-Machine [M2M] data, and by connecting through Power Center, ultimately leading to Machine-to-Human [M2H] Data. The goal is to have VDMs residing in mobile devices, sensor packages, or as part of sensor networks. At this point, VDMs require more processing power than available in most components. Thus, Vibe and VDM are primarily suited today to data, network operations, and communication centers.
However, Informatica is seeing a broad range of use cases involving both large machines and sensor networks, from many different sectors including
One Proof of Concept [PoC] currently underway is with a Heating, Ventilation and Air Conditioning [HVAC] company. In the PoC, the HVAC company is looking at streaming data from all of their installations. Using Informatica products, they are bringing this data into their data center for both streaming and batch analytics. There are actually three use cases being examined in this PoC:
Other field trials look at Vibe and VDM capabilities in regard to Pub/Sub models working with Informatica Ultra Messaging, as well as persisting data in all forms of data stores from traditional Enterprise Data Warehouses [EDW] to Hadoop [HDFS] and NoSQL databases such as Cassandra. These field trials involve solving the ongoing problems of the different areas mentioned above.
Perhaps the most involved trials begin done to date with Informatica Vibe and VDM, are within the Telecommunications space. As one might expect, the explosion of data and customer expectations, as cellular goes from 2G to 3G to 4G/LTE requires real-time management of ever increasing amounts of data. But also the wireline/fiber and cable use cases are exploding as the traditional market places of voice, entertainment and connectivity intertwine.
Informatica is aggressively working with partners, such as chip, sensor and package manufacturers, to understand how to optimally implement Vibe, whether that is through streaming collection capability of Vibe on the device itself or as part of the larger infrastructure at some point in the collection tier to implement the needed streaming collection. Currently, collecting sensor data can hit performance limits using the sensor or communication base protocols. Thus, for example in the oil and gas industry, Informatica is working with both vertical-specific sensor manufactures and large organizations in the industry, to determine how Vibe can supplement or even replace the collection tier.
What Informatica brings to evolving sensor analytics ecosystems [SAE] is not only their specific technologies of Vibe and VDM, but combining these with a complete package for supporting streaming analytics, operational intelligence, complex event processing [CEP], batch analytics, predictives, reporting, data marts and EDW, through their existing technology families such as Ultra Messaging, Power Center, Master Data Management, Data Quality, and more, both through traditional and Cloud deployments. This results in bringing mature market features to the SAE in the form of
This blog post is based upon both the Informatica Press release referenced below, and a private briefing from the Informatica team that allowed us to gather more information and get answers to our questions. Also referenced are other of our blog posts on IoT and Big Data, for context.
Back in July, I wrote
[An] excellent example of the importance of the Industrial Internet comes from Salesforce.com use of The Social Machine by Digi International and its Etherios business unit, in bringing sensor data into customer relationship management [CRM] by allowing sensors embedded in industrial refrigerators, hot tubs, and heavy and light equipment of all types to open SFDC chatter sessions and to file cases.
At Dreamforce 2013, Salesforce.com is announcing Salesforce1, their new Internet of Customers ecosystem, bringing together Force.com, Heroku, and ExactTarget FUEL platforms under a united series of APIs controlled by the Salesforce1 App.
Today and tomorrow, Dreamforce is all about the Internet of Things, and I'll be providing my analyses of how SFDC is building out it's massive existing ecosystem of parnters, services and customers into Marc Benioff's evolving vision of the Internet of the Customer. The message here, is how Salesforce1 is ready today to prepare their customers to leverage the opportunities presented by the Internet of Things today. As Cisco states, over a trillion dollars in added value was left on the table this year by companies not taking advantage of IoT. For 2014, SFDC's customers won't have an excuse to leave this money behind.
One challenge for Salesforce1 is its dependence on partners for analytics. Are SFDC partners ready to help in bringing the Internet of Customers to full potential through connected analytics? How will IBM's MQTT, Smarter, and Cognitive Computing, Oracle's Device-to-Data-Center, Teradata's Hub for Monetizing the IoT, Infobright's M2M optimized ADBMS, and many other data management & analytics initiatives focused on M2M and M2H data fit in?
Will Salesforce1 create or be integrated into Sensor Analytics Ecosystems, with the necessary marketplaces for raw, processed and insights from M2M & M2H data? SFCD has never been up to the challenge of analytics in the past. While there are many general BI and Analytics partners, SFDC specific analytics firms have come and gone. Salesforce1 is a broader concept and brings SFDC into a future beyond salesforce automation and customer relationship management.
The IoT Keynote at Dreamforce today, and the packed sessions on IoT will answer some of these questions. I'll be providing my analysis of how well these questions are answered in an Event Report blog post after the close of Dreamforce 2013.
The week of 2013 October 28 was a big one for Paxata, Inc. Founded in January of 2012, followed by advisories, beta customers also known as "Pax Pros", and 12 sprints, Paxata quietly released their first GA product in May of 2013. With panels and debuts at the Strata + Hadoop conference in New York and other events, leading up to announcements and demonstrations at the Constellation Connected Enterprise at the Ritz Carlton in Half Moon Bay, California, Paxata officially left stealth mode, publicly discussing:
The most wondrous feature of the Paxata Adaptive Data Preparation Platform is how it adds semantic richness to one's data sets by automatically recommending and linking to third-party and freely available data. This allows one to bring in firmographic, demographic, social and machine data within the context of the user's goals. This is what truly allows the Paxata Adaptive Data Preparation Platform to go beyond data exploration and discovery.
Paxata has received a fair amount of press as well, some of which I've referenced below. However, all this press misses what is one of the most important additions Paxata makes to the toolboxes of Data Management & Analytics [DMA] professionals… the ability to present questions to the user that they may not have thought of on their own. Paxata was one of the companies that inspired my DataGrok blog post. Paxata was in stealth at the time, and couldn't be named then. Now, I'm happy to be able to write that Paxata is one of the few companies or projects building tools that allow the creator and user of data to go beyond data discovery, beyond data exploration, to being able to fully, deeply understand their data. Data discovery and data exploration tools allow one to determine if various data sets can answer the questions posed by business, engineering or scientific challenges. These tools go further by exposing data integrity issues among data sets or data quality problems within a data set. Some such tools might help the user find new data sets or how various data sources within an organization might fit together in a data warehouse. Some hark back to grep, sed and awk to parse textual data. Others provide probabilistic and statistical tools to determine the appropriate shape, distribution or density functions of a data set. But Paxata is one tool that does all these and more, and does it through your web browser in a collaborative fashion, maintaining the history of each collaborator's operations on the data sets.
When my partner, Clarise, and I were first briefed by Paxata in November of 2012, we were so excited that we stayed over three hours. The demonstration, of what was then a much rougher product than what you see today, incited both of us to exclaim how much we wished that we had this tool back in our DMA practitioner days. We were treated to a demonstration using the data from another Constellation Research customer with which we were familiar. Over a year later, we were treated to a pre-launch briefing using current data sets from that same customer. The ease of use, the pleasantness of the user experience, the simplicity with which one could complete complex tasks, from histograms to column-splitting, showed the maturity that Paxata had gained since our first exposure. What was most important to us, was that Paxata could show a solution for every need that we would like to see in the Adaptive Data Preparation Platform, from our experiences in implementing data warehousing and business intelligence programs since 1996, as well as our decades of experience in computational statistics and operations research.
It allows data warehousing and BI extract, transform and load professionals, business analysts, data scientists, chemists, physicists, engineers, researchers, and professionals of all skills who work with data to completely understand and resonate with their data sets. The Paxata Adaptive Data Preparation Platform does what few other tools can do, it provides clues to what you didn't know to ask. It poses questions that the data can answer, but that you didn't think to ask. And it does all of this in a familiar looking interface, in HTML 5, in your favorite web browser, wherever you are, whenever you need it. In Paxata's words:
Paxata pricing is published and open. There are three subscriptions available:
Each of the Paxata subscriptions build upon the first, from an individual subscription to the ability for those with individual subscriptions to share in a single environment, to a full organization-wide subscription. Of course, what makes this possible, is that the Paxata Adaptive Data Preparation platform is available as a Cloud service, accessible through any modern HTML 5 web browser whether that's from a sophisticated, high-end workstation, a tablet or smart phone.
The main value comes not from a nice-looking, fairly intuitive interface, but from the underlying technologies that makes Paxata so useful: powerful Mathematics, Semantics and Graph Theory algorithms. The results of which are easily accessible through this Cloud-based, web experience, while the complexities are under the covers, not getting in the way. This fact is what makes the Adaptive Data Preparation Platform so accessible to business analysts, and other creators and users of data who are not PhD statisticians. Paxata uses proprietary algorithms that detect relationships among data sets, using probabilistic techniques to select the best joins, semantically typing the data so that it can intelligently enrich the data, clean the data and merge the data based upon context not just metadata. All of this is done in an ad hoc fashion, with no predefined models or schæmas needed. These proprietary algorithms make use of
Distributed computing and in-memory technologies allow these computational statistics algorithms to be,cost effectively executed in parallel, across massive data sets. Coupled with the advancements in visualization technologies, Paxata is able to address a 13.5-16 Billion dollar market over next three years, with extremely attractive pricing. The true return on investment from Paxata comes from flipping the DMA equation around. Currently, a common truism is that 80% of the time on a DMA, Data Science, DW or BI project is spent in preparing data; 20% in analyzing the data. Paxata reduces that data preparation percentage, such that 70% is analytics, 30% is preparation. This reduces not only the labor directly involved in preparing the data, but also allows an Agile framework to address significant business needs at the right time, in a sustainable fashion.
Paxata's strategy is to attach to the QlikView and Tableau markets that are being hampered from enterprise adoption because of these very data preparation challenges. Along with these partnerships, is the partnership with Cloudera, providing enterprise class access to modern, distributed data storage systems. Add connectors to common enterprise and external data sources and the third-party Paxata Enrichment Libraries, and it is obvious to the most casual observer that the Paxata Adaptive Data Preparation Platform addresses the most frustrating complaint of Data Scientists and Business Analysts alike: that too much of their time is spent on plumbing, whether directly or waiting for IT. We have long spoken about the need for IT to give up control of data, and realize that their most effective role is to provide a framework of success for end-users to fully, deeply understand and use their data to solve real problems. Paxata creates this framework for success.
Other Sources to learn about the Paxata launch:
The number of articles about the Internet of Things [IoT], Machine-to-Machine communication [M2M], the Industrial Internet, the Internet of Everything [IoE] and the like have been increasing since I wrote my post introducing my IoT mindmap almost a year ago. I learn from some of them, some I nod sagely in agreement, and others cause me to scratch my head in confusion. One in particular this last week fell in that last category, when they claimed that all the terms listed here all mean the same thing.
From my reading, briefings and research over the past year, I've come to a different conclusion. The following definitions are my opinion. I can't say that any authority has certified these definitions. I believe them to be accurate, and if any vendor with an interest in any of these definitions strongly agree or disagree, I would be very much interested in talking with you.
The first thing to be considered is Machine-to-Machine communication. M2M is really just one of four types of interchanges that occur over the Internet, intranets and any command, control, communication, computing or intelligence network. The other types are Human-to-Machine [H2M], Human-to-Human [H2H] and Machine-to-Human [M2H]. H2M and H2H interchanges have been around since the beginning of ARPAnet, which evolved to become the Internet. From the many different protocols at the beginning, such as FTP and Gopher [among many more], two have come to dominate Internet traffic:
Every transaction made using a computer: online transaction process [OLTP] electronic data interchange [EDI], and eCommerce; every purchase you make at your favorite web store, is an example of H2M.
Of course, starting with email [still the dominant form of communication over the Internet and for businesses and individuals] and expanding to Twiter, Facebook, Waze, Yelp, Foursquare, Yammer, all the various instant messaging networks, voice over Internet protocol [VoIP] and your favorite public or private social network, we have many examples of Internet enabled H2H communication.
These two, H2M and H2H, have become so prevalent, and so important to business, governments and our personal life, that the over-hyped phenomenon "Big Data" was born. But the importance and pervasiveness of M2M, and soon, M2H data will swamp the so-called data tsunami of the past decade. Predictive maintenance, building automation, elastic provisioning, machine logs, software "phoning home" and automated decision support systems are all good examples of direct M2M interchanges where one sensor, device, embedded computer or system has a productive exchange with another such machine, without concurrent human intervention. Self-quantification, gamification, personalized medicine and augmented reality [AR] are all early examples of M2H interchanges, where sensors, devices, embedded computers or system directly provides relevant information to an individual, allowing for better informed decisions.
The Internet of Things was coined in 1999 by Kevin Ashton. Since then, the term has come to mean any device that is connected to the Internet. Most people don't consider computers, routers, edge equipment and other Internet infrastructure hardware to be a "device", and usually exclude such hardware from consideration as a thing that uses that infrastructure. For many, the devices are only smart phones, feature phones and tablets. This has led to predictions by Cisco and GSMA to declare that there will be 30 to 50 billion devices connected to the Internet by 2020. However, even these organizations, and most people with whom I speak who have skin in the IoT game, feel that my own prediction of one trillion devices connected to the Internet by 2020 is more likely. These devices span from individual, but connected sensors, to heavy machinery. However, as companies come out with Tweeting diapers, glowing clothing and other such silliness, the Internet of Things is in danger of becoming a fad. So, what is the Internet of Things? To my mind, the Internet of Things comprises any sensor, embedded sensor, embedded computer, component, package, sub-system, systems, or System that is connected to the Internet and intended to have meaningful interchanges with other such items and with humans. The Internet of Things primarily uses M2M and increasingly M2H interchange.
The first treatment of the IoT as large, complex system, to which I was exposed was at networking event in 2008… One of those events where IBM was introducing their new initiative for a Smarter Planet. The Smarter Planet brings complex systems such as the Smart Grid, building automation across facilities, water management, traffic management, Smarter Cities and Smarter Farms under one System. One approach and one initiative that raises the IoT to a new level of importance for world governments, global businesses and individuals from the poorest village to the most cosmopolitan city. The Smarter Planet initiatives go beyond IoT, beyond the individual things, to treating all such things, the Internet, the protocols, process and policies as one very large, complex, possibly cognate system.
The Industrial Internet is a term coined by General Electric [GE] in 2011. At a very simple level, the Industrial Internet can be thought of connected industrial control systems. But the impact is much more complex, and much more significant. The first thing to be realized is that connected sensors and computing power will be embedded in everything, from robots and conveyor belts on the factory floor, to tractors and irrigation on the farm, from heavy equipment to hand drills, from jet engines to bus fleets; every piece of equipment, everywhere. The Industrial Internet also primarily uses M2M and M2H. While this sounds much like the Internet of Things, the purpose is much different. The Industrial Internet is about changing business processes and making data the new coin of the realm. GE is very serious about the Industrial Internet, and while they don't use the term yet, Sensor Analytics Ecosystems. Data Marketplaces are rapidly becoming core to GEs businesses, as proven by their recent 140 million dollar investment in Pivotal, the new Big Data Platform as a Service [PaaS] by EMC. Another excellent example of the importance of the Industrial Internet comes from Salesforce.com use of The Social Machine by Digi International and its Etherios business unit, in bringing sensor data into customer relationship management [CRM] by allowing sensors embedded in industrial refrigerators, hot tubs, and heavy and light equipment of all types to open SFDC chatter sessions and to file cases.
Cisco has recently started two initiatives related to the IoT, the Internet of Everything [IoE] and Fog Computing. IoE seeks to bring together H2H, H2M, M2M and H2H interchanges. On June 19th of this year, Cisco introduced their IoE Value Index [link to PDF]. By bringing together people, processes, data, and things, and with some impressive research to back it up, Cisco feels that the IoE, in 2013, could bring 1.2 Trillion Dollars in added value, and by 2022, 14.4 Trillion dollars in added market value to business around the world. Fog Computing tends more to the infrastructure of the IoE, bringing the concepts of Cloud Computing, such as distributed computing and elastic provisioning, to the edge of the network, with an emphasis on wireless connectivity, streaming data, and heterogeneity.
While some of the above are corporate initiatives, they each represent important and distinct concepts. In addition to these from IBM, Cisco, GE, EMC and Salesforce.com, there are other initiatives and products, in this sphere, coming from HP, Oracle, SAP, MuleSoft, SnapLogic, Nuance, Splunk, Mocana, Evrythng, Electric Imp, Quirky, reelyActive, Ayla, SmartThings, Withings, Fitbit, Jawbone including BodyMedia, Nike, Basis, Cohda Wireless, AT&T, Verizon, Huawei, Orange, Belkin, DropCam, Gravity Jack, Alcatel-Lucent, and Siemens. Platforms, software, sensor packages and services, are being developed by a wide variety of innovative companies:
These innovative companies, and others, are implementing one or more of these concepts in a variety of ways. As I stated at the beginning, I don't think that these concepts are the same. While the IoT was first named 14 years ago, it is still early days in its implementation. There are many ways that the Internet of Things might evolve, and many missteps that could lead the IoT to be a passing fancy, leaving some important changes in its wake, but never reaching its full potential. I think there is one way, and one way only, that all of the concepts and initiatives will come together and change everything that we do, how we make decisions, how we think about ourselves, how governments make policy, how businesses make money: The Sensor Analytics Ecosystem [SAE]. Here's a tease of a mindmap giving a hint of what I mean by the SAE. Look for my upcoming report "Sensor Analytics as an Ecosystem" and a series of research reports delving into each area introduced therein. The companies listed above are building out parts of the SAE, and will feature heavily in these reports.
EC3 Energy Home PageEC3 Energy Home Page
|<< <||> >>|