Posted by Ingo Fuchs
A significant challenge in managing large amounts of data (or Big Data) is a lack of what I like to call “total data awareness”. It’s a situation where you know (or suspect) that you have data – you just can’t find it. When you think about many current IT environments, they are often not built for total data awareness. This starts with core elements of the IT infrastructure, such as file systems. Traditional file systems and access methods were not designed to store hundreds of millions or billions of files in a single namespace. This leads to admins storing data in multiple file systems, multiple shares, complex directory structures – not because the data should be logically organized in that way, but simply because of limitations in file system architectures. This issue becomes even more pressing when data sits in multiple locations, maybe even across on-premise and off-premise, cloud-based storage.
Is object-based storage the answer?
Think about how you find data on your computer. Do you navigate complex directory structures, trying to remember the file name of the file that hopefully has the data you are looking for – or have you moved on and just use search tools like Spotlight? Imagine you have hundreds of millions of files, scattered across dozens or hundreds of sites. How about just searching across these sites and immediately finding the data you are looking for? With object storage technology you have the ability to store data in objects, along with metadata that describes the object. Now you can just search for your data based on metadata tags (like a filename – or even better an account number and document type) – as well as manage data based on policies that leverage that metadata.
However, this often means that you have to consider interfacing with your storage system through APIs, as opposed to NFS and CIFS – so your applications need to support whatever API your storage vendor offers.
CDMI to the rescue?
Today, storage vendors often use proprietary APIs. This means that application vendors would have to support a plethora of APIs from a number of different vendors, leading to a lack of commitment from application vendors to support more innovative, object-based storage architectures.
A key path to solve this issue is to leverage technology and standards that have been specifically developed to provide this idea of a single namespace for billions of data sets and across locations and even managed services that might reside off-premise.
Relatively new on the standards side you have CDMI (http://www.snia.org/cdmi), the Cloud Data Management Interface. CDMI is a standard developed by SNIA (http://www.snia.org), the Storage Networking Industry Association, with heavy involvement from a number of leading storage vendors. CDMI not only introduces a standard interface to ingest and retrieve data into and out of a large-scale repository, it also enables applications to easily manage this repository and where the data sits.
CDMI is the new NFS
Forgive the provocation, but when it comes to creating and managing large, distributed content repositories it quickly becomes clear that NFS and CIFS are not ideally suited for this use case. This is where CDMI shines, especially with an object-based storage architecture behind it that was built to support multi-petabyte environments with billions of data sets across hundreds of sites and accommodates retention policies that can reach to “forever”.
CDMI and NFS have something in common – Ethernet
One of the key commonalities between CDMI and NFS is that they both are ideally suited to be deployed in an Ethernet infrastructure. CDMI, specifically, is a RESTful HTTP interface, so it runs on standard Ethernet networks. Even for object storage deployments that don’t support CDMI, practically all of these multi-site, long-term repositories support HTTP (and thus Ethernet) through proprietary APIs based on REST or SOAP.
Why does this matter
Ethernet infrastructure is a great foundation to run any number of workloads, including access to data that sits in large, multi-site content repositories that are based on object storage technologies. So if you are looking at object storage, chances are that you will be able to leverage existing Ethernet infrastructure.