ARTIFICIAL INTELLIGENCE DATA STORAGE

Aug 26, 20216 min read

Author: Yeshwanth Buggaveeti

We describe existing services automating data storage monitoring and management. Scientists want to get access to their data as soon as feasible. Some experiments need the storage of a large amount of data, which must be kept safe. By this, I mean that the data cannot be lost or destroyed, and that the data storage should be as inexpensive as feasible. I present a knowledge-based system that can manage data, i.e., make decisions about transferring, replicating, or deleting data. I go over some of the market's most popular existing options. In this work, I want to describe our system, which deals with data storage management using AI approaches such as fuzzy logic or a rule-based expert system.

Introduction:

The issue of data storage and exchange is getting increasingly serious and pressing. It is frequently essential in scientific computing to do computations on enormous amounts of data. Many of the tests carried out result in massive volumes of data that must be kept. Furthermore, while defining computation quality, it is frequently essential to specify the amount of access to data. Aside from scientific computers, the increase of data and the challenges connected with its storage occur in other sectors, such as industrial solutions. In both situations, it is critical to meet additional quality standards, such as data security, in addition to the necessity to deliver rapid services in the context of data access. These requirements may be related to security, ensuring the transparency of continuous data access.

AI Storage Requirements:

1. Extensibility:

Artificial intelligence systems can analyze massive quantities of data in a short period of time, which is critical since big data sets are necessary to produce correct algorithms. This data volume necessitates a considerable increase in storage requirements. To teach computers to talk, Microsoft, for example, needs five years of continuous voice data. With 1.3 billion miles of driving data, Tesla is teaching vehicles to drive. Handling huge data volumes need a storage system that can expand indefinitely.

Solution by cloud; Object storage is the only form of storage that may scale indefinitely inside a single namespace. Furthermore, because of the modular architecture, capacity may be expanded at any moment. You may scale in response to demand rather than ahead of it.

2. Storage options based on software:

Large data collections may necessitate the use of hyperscale data centers with pre-built server architectures. The ease of use of pre-configured appliances may be advantageous in other deployments. For that many companies like Amazon, IBM and Tesla provides (SAAS) software as a service or software-defined storage.

3. Cost efficient:

A usable storage system must be scalable as well as cheap, two characteristics that do not usually coexist in business storage. Historically, massively efficient systems were more expensive in terms of cost/capacity. Large AI data sets are not possible if the storage budget is exceeded.

Cloud solution, that’s why cloud offers (PAAS) Platform as a service in this,how much of the time/service you utilized; you are only charged for the amount of time/service you used. In this way organizations/companies can save their operational cost.

4. Durability of data:

Backing up a multi-petabyte training data set is not always possible; it is frequently both expensive and time-consuming. But you also can't leave it unguarded. Rather, the storage system must be self-protecting.

Cloud solution, Memory processes is created with redundancy in mind, so data is preserved without the need for a separate backup procedure. You may also choose the amount of data security required for each data type to optimize efficiency. Systems can be set up to withstand numerous node failures or even the loss of a whole data center.

5. Locality of data:

Since some AI data will be stored on the cloud, much of it will be kept in the data center for a variety of reasons, including performance, cost, and regulatory compliance. To compete, on-premises storage must have the same cost and scalability benefits as cloud-based storage. The cloud's storage is known as object storage. Cloudian (AI/cloud storage Company), in reality, provides object storage solutions to a number of cloud providers for use as public cloud infrastructure. Cloud storage scalability and economics are now accessible on-premises.

6. Integrating the cloud:

Integration with the public cloud will remain an important need regardless of where data is stored for two reasons. For starters, much of the AI/DL innovation is taking place on the cloud. Cloud-integrated on-premises systems will offer the most flexibility in leveraging cloud-native capabilities. Second, as information is created and evaluated, we may expect a continuous flow of data to and from the cloud. An on-premises solution should simplify rather than restrict that flow. Cloudian is cloud-integrated in three ways. First, it makes use of the S3 (storage API), the de facto standard language of cloud storage. Second, it enables tiering to Amazon, Google, and Microsoft public clouds, as well as viewing local and cloud-based data inside a single namespace. Third, data saved in the cloud via Cloudian is immediately accessible by cloud-based apps. With this bi-modal access, you may use both cloud and on-premises resources interchangeably.

AI stages and I/O requirements:

AI's storage and I/O requirements change during its lifespan. Conventional AI systems require training, and during that time, they will be more I/O-intensive, allowing them to take use of flash and NVMe(Non-volatile Memory Express). However, the “inference” step will rely more on computing resources. Deep learning systems, because of their capacity to retrain themselves as they operate, require continual data availability. “When some companies talk about storage for machine learning/deep learning, they frequently mean model training, which requires extremely high bandwidth to keep the GPUs busy,” explains Doug O'Flaherty, director at IBM Storage. However, controlling the full AI data pipeline from intake through inference is where a data science team will see the most productivity gains.

The outcomes of an AI software, on the other hand, are frequently tiny enough that they do not pose a problem for current business IT systems. This implies that AI systems require layers of storage, which is similar to traditional data science or even enterprise resource planning (ERP) and application servers. As a result, fewer GPU-intensive apps may be good candidates for the cloud. Google, for example, has created AI-specific processors that are compatible with its infrastructure. However, as IBM's O'Flaherty cautions, given the technological and budgetary restrictions, the cloud is more likely to assist AI than to be at its core for the time being.

Architecture for AI storage:

Building a greater business around AI has been a primary objective for IBM in recent years, especially lately under new CEO Arvind Krishna, right up there with utilizing the cloud, which was made possible in part by its $24 billion acquisition of Red Hat last year. In most parts of the firm, IBM is immersed in AI projects. As an example, consider what IBM is doing with Watson technology and IBM Cloud Pak for Data, an integrated data and AI platform.

According to Eric Herzog, vice president and chief marketing officer of worldwide storage channels for IBM Storage, what IBM is doing with the new storage offerings is helping to create an architecture that can support the complex AI- and analytics-optimized workloads that enterprises are grappling with. Organizations must be able to gather, manage, and analyze data in order to efficiently operate such workloads, and then utilize the knowledge to speed business decisions and product and service development. This entails guaranteeing access to all required data sources, as well as maintaining and analyzing the data collaboratively, regardless of where it is located.

For example, the firm revealed the Elastic Storage System (ESS) 5000, an all-hard drive array intended for data lakes that is optimized for data gathering and long-term capacity. The 2U system supplements the ESS 3000, an all-flash NVMe array released in October 2019 that is similarly geared for AI and analytics tasks, and provides lower cost and higher density.

IBM also updated its Cloud Object Storage (COS), improving performance to 55 GB/sec in a 12-node setup, with read and write increases of 300 percent and 150 percent, respectively, dependent on object size. COS also supports Shingled Magnetic Recording (SMR) drives, which are high-capacity disc drives with 1.9 PB capacity in a 4U disc case. All of these will help with integration with high-performance AI.

Conclusion:

we had read about the features, requirements and architecture of AI storage. The idea of keeping data on physical devices may appear absurd in the near future. It's impossible to overlook the numerous advantages of cloud drives, such as flexibility, backup, portability, connection, and scalability. Cloud storage is the favoured method for companies expanding their infrastructure or developing new technologies since it is the best approach to distribute apps.

References:

1. https://www.computerweekly.com/feature/AI-storage-Machine-learning-deep-learning-and-storage-needs

2. https://www.nextplatform.com/2020/07/13/an-architecture-for-artificial-intelligence-storage/

3. https://cloudian.com/resource/data-sheets/eight-storage-requirements-artificial-intelligence-deep-learning/

4. https://www.calsoftinc.com/blogs/2020/08/how-ai-is-reshaping-storage-technologies.html

Madras Scientific Research Foundation