Cloud 101

The Top 5 Data Storage Requirements for AI

Are you exploring artificial intelligence (AI) and wondering how its demands will impact your IT infrastructure? It is widely recognized that the generative AI (GenAI) technology stack and AI workloads, such as model training and natural language analysis, require processing large amounts of data. What often receives less attention is where all that data will be stored.

This article examines the role of data storage in driving AI success and outlines the top five data storage requirements for AI.

The role of storage in the GenAI stack

In the GenAI technology stack, storage plays a crucial role in enabling high-performance, scalable, and economical AI workloads. The GenAI tech stack comprises multiple elements, including programming languages such as TensorFlow and PyTorch, machine learning (ML) models that train on massive datasets, data processing tools, and data storage. Storage supports AI workloads, such as model training and predictive analytics.

The data storage that supports the GenAI stack should be optimized for AI. While it is possible to use traditional storage for AI, it’s not ideal. AI workloads simply handle too much data. Legacy storage solutions cannot handle the volume, variety, and growth rate of AI data. AI storage is designed to manage the immense data flows created by the AI stack. Highly available and fault-tolerant, AI storage reduces inefficiencies, scales as needed, and performs at the speed required by AI servers.

1. Data volume and types

AI and ML models require large volumes of diverse data. The specifics will depend on the type of model. An ML model that deals with images will need image data. A model that generates text will need text data, and so forth. Usually, the data for the model includes both structured and unstructured data. Structured data is stored in databases and arranged in columns and rows. Unstructured data covers everything else, like documents, images, emails, and social media comments.

The data for the AI/ML models undergoes a four-stage lifecycle, which has an impact on storage:

Collecting and storing raw input data — This stage involves assembling data sets relevant to the model, in raw, unprocessed form. The raw data tends to be messy and incomplete in its native formats.

Refining the raw input data into training data — The ML platform’s data processing pipeline cleans and transforms the raw input data into a format that the ML algorithm can use to learn patterns and relationships between data points. For example, refining reviews on social media might involve assigning “sentiment labels” such as “positive” or “negative.”

Assigning model weights — The ML platform assigns parameters to the data to facilitate the learning process.

Generating inference logs — These records chronicle the inputs that each data point received during the training process. The resulting inferences enable the model to function in the real world.

The ML data lifecycle requires significant data storage capabilities. The raw input data can be quite voluminous and must generally be kept separate from production data. In other words, it’s a full-sized copy of your data. Then, the refined data and weighted data, along with associated metadata like weights and labels, triple the load created by the raw inputs.

The amount of data required for raw input will depend on the nature of the ML training task. As a reference point, training an ML model to recognize the subject of an image, such as “This is a picture of a cat,” typically requires around 1,000 images; however, some models may need significantly more images to succeed. At two megabytes per image and a need to train the model on multiple subjects, the amount of data written into storage can quickly become enormous.

With text-based training, such as for large language models (LLMs), the input text data requires less memory, approximately two kilobytes per example, but there may be millions of texts for the model to parse.

2. Performance

AI storage should perform with high throughput and low latency.

High throughput allows the model to read/write large datasets quickly. Low latency enables the model to operate at real-time speed. These two qualities reduce bottlenecks in processing that leave processors idle and underutilized. Without high throughput and low latency, the AI training process can be prolonged, which increases costs. If AI storage works effectively, it can bypass system memory through direct data paths to the graphics processing unit (GPU), further enhancing the model’s performance.

Sequential and random data access patterns also characterize high-performing AI storage. The two patterns are opposite, but both are necessary.

Sequential access means reading or writing data in a continuous, linear order. This approach to data input/output is typically faster than alternatives because it avoids operations that “seek” data and waste time. Sequential access also helps expedite processes like data ingestion and processing, which are integral to AI/ML.

Conversely, random access enables the AI system to access any data point directly, without needing to read preceding data points. This capability is important in AI because models sometimes randomize data to avoid becoming “overfitted” to the same dataset. With random access, this structure does not slow down the training process. In the inference stage, random access enables rapid access to non-contiguous data points.

3. Scalability

AI data sets inevitably grow larger over time, sometimes in significant, rapid leaps. Use cases change, and organizations want to try new AI processes. This dynamic results in more data for model training and ongoing AI functionality. AI storage must be highly scalable and user-friendly to support these needs. Global AI data projections confirm this expectation. According to IDC, AI data center storage capacity is expected to grow from 10.1 zettabytes (ZB) in 2023 to 21 ZB in 2027. That represents a compound annual growth rate of over 18.5%.

AI storage must also adapt to evolving data types and workloads. For instance, a company might begin its AI journey with a relatively simple use case, such as AI chatbots to assist with customer support. That is not a heavy load for storage. However, the company may then want to apply AI to safety camera videos from factories, which generate extremely large volumes of video data. Storage needs to keep pace without compromising performance or incurring disproportionate cost increases.

4. Security

Don’t overlook the potential security risk exposure inherent in accumulating data for AI. Malicious actors will attempt to breach data repositories holding AI/ML training data because AI model data is a valuable asset. The data is also an inviting target because it may not be subject to the same controls that protect data elsewhere in infrastructure. AI storage must play its part in defense by supporting all appropriate controls and countermeasures. The same applies to compliance with privacy laws and similar regulations.

In parallel, AI storage must protect the AI dataset from attacks that could compromise its integrity or availability. For example, a ransomware attack could permanently stop an AI solution from working.

Protection against these attacks includes enabling strong encryption and controls, such as immutable storage buckets, which make it impossible for anyone to modify or delete AI data. You can also add protections on top of immutability with the newest feature from Wasabi, Covert Copy, which effectively prevents ransomware attacks and data theft by creating a hidden, virtually air-gapped copy of your data. Wasabi also offers AI-driven storage security features, such as Multi-User Authentication, which requires multiple individual users to confirm actions like deleting a file or account.

5. Cost efficiency

Cost management may not currently be the highest priority for companies trying to optimize AI workflows, but it soon will be. As the parameters of AI’s return on investment (ROI) become more well understood, companies will focus on costs. Given the large and growing AI datasets, storage costs will come under immediate scrutiny.

AI storage needs to be cost-effective. AI cloud storage should be reasonably and predictably priced. The right storage platform for AI workloads does not charge fees for routine functions, such as API access, data retrieval, and data egress. Such fees, which are common with most cloud storage platforms, add significant, variable, and unexpected costs to AI storage budgets.

Wasabi cloud object storage for AI

Wasabi cloud object storage for AI keeps your AI pipeline efficient, secure, and cost-effective throughout the AI data lifecycle. From initial data ingestion to long-term AI model retention, Wasabi cloud object storage for AI offers a high-performance, cost-effective solution specifically designed for AI workloads. It maintains both structured and unstructured datasets for future reuse, comparison, and compliance—capabilities that are optimal for AI.

Data ingestion and inference work at high speed with Wasabi cloud object storage for AI, facilitating rapid training cycles with low latency. The platform delivers scalability, up or down on demand, to adapt to evolving needs and workloads. Single-tier storage pricing eliminates the complexity of tiering and avoids the potential for hidden fees when accessing your data. Its S3-compatible API enables seamless integration with leading AI platforms, ML tools, and data management solutions.