Source: Toolbox Tech
This month, Amazon S3 became a teenager, turning 13 as the first AWS service Amazon launched. AWS now offers over 100 services, covering virtually every layer of the technology stack. What began as infrastructure offered on-demand has evolved to be the standard for many companies as they develop applications that further their digital transformation initiatives.
The idea of AWS is captivating—making core building blocks of technology available with the click of a button, with just a few moments lead time, and paying for what you use. Instead of waiting weeks and months for central IT, consumers of technology can focus on building the capabilities in their applications that differentiate their offerings in their respective markets. Just as Google forever changed our expectations for simple, fast access to information, AWS has forever changed how we expect to consume infrastructure and other core technologies.
However, when it comes to data, companies are still trapped in a decades-old paradigm of being dependent on IT. While software engineers can quickly provision a database to store and manage access to data for an application, data consumers must open a ticket with IT, and wait months (as we once did for new servers) to gain access to the data we need to do our jobs. What Google did for public information on the internet has yet to be done for enterprise data and the data consumer.
The idea of “as-a-service” that Amazon championed for infrastructure to the benefit of software engineers now needs to be applied to enterprise data to the benefit of data consumers, including data scientists, analysts, and BI users, over 200 million people globally. Just as software developers can provision infrastructure and services for a new application, on-demand and with minimal lead time, data consumers need to be able to provision data for training a machine learning model or visualization of an important analysis, working with their favorite tools, and without relying on IT to do this work on their behalf. New dashboards should be able to be created in a few minutes rather than weeks and months.
Delivering on this vision is a massive challenge. Data is far more massive, complex and variable than infrastructure and software services. While a Fortune 500 company may deal in thousands of instances on their favorite cloud platform, an individual analytics job can easily involve dozens of data sources and billions of data points, as well as transformations and enrichment in advance of the actual analysis. Another bottleneck in this process today is the ratio of data consumers to data engineers, typically over 100:1 in most companies today. As a result, every data consumer ends up standing in line, waiting for their turn with IT, which prolongs lead times for access to data.
Data-as-a-Service is a new strategy that focuses on making the data consumer more self-sufficient and independent. Through a combination of open source technologies and best practices, companies can develop a paradigm by which data engineers are more productive in their support of data consumers, ensuring governance, security, and availability of the service, while data consumers spend the majority of their time doing what they do best—making sense of the data to help the business operate more effectively.
What are the building blocks of Data-as-a-Service? First, companies need to move away from making endless copies of data they move around between different technologies and environments in support of making data more accessible with better performance. Examples include things like cubes, data marts, and aggregation tables, which are created to give different users faster access to a subset of enterprise data.
Instead, companies should develop a strategy where datasets are provisioned on-demand using data virtualization capabilities that provide high-performance access to data from any source, applying transformation and ensuring access controls and masking of sensitive data dynamically, at query time. While this idea has been around for many years, it has been plagued with complexity, slow performance, and no ability to provide self-service for the data consumer. Today, advances in hardware and new open source projects like Apache Arrow simplify and accelerate access to data, making this approach feasible in a way it has never been before.
Companies also need to think in terms of a central, vetted catalog of their data assets. Ask your analysts “where would you find data to answer a question about our customers in Europe over the past 180 days” and the answer is frequently “we would ask IT.” But things are very different in their personal lives—if they were planning a vacation near the stadium their favorite sports team, they would simply search on Google and have the answer instantly.
Companies need a similar ability for their data consumers, one that captures not only where the data is, but also whether the data is sensitive, who to contact for questions, information about the provenance and lineage of the data, and the ability to preview and jump into the data with fast, reliable access using their favorite tools.
Data consumers within the enterprise frequently require customized datasets that have not yet been created, such as datasets focused on a period of time, geography, or business unit. Typically this would mean these data consumers get in line for their turn with IT. With Data-as-a-Service, the data consumer can do this work themselves. They begin by searching the data catalog to find relevant datasets. The data consumer can then apply transformations and filters with a visual, guided experience, joining multiple sources together when appropriate by taking advantage of recommendations from the system. Quickly and intuitively, the data consumer can provision data for their own needs, without being dependent on IT.
Companies can avoid making unnecessary copies of data for thousands of data consumers by managing their datasets as virtual objects in their Data-as-a-Service platform. In addition, this approach minimizes the risk of security and governance challenges.
Data-as-a-Service is a strategy that companies can implement in the cloud, on prem, and in a hybrid model. Today most companies manage their data in many different silos, including relational databases, data warehouses, data marts, NoSQL databases, and object stores like Amazon S3. With Data-as-a-Service, companies can make all their data assets available to data consumers using open source software and cloud services that help them get more value from their data, faster.