DATA PROTECTION

Security and Compliance Concerns for AI Datasets

June 5, 2025David Boland

As businesses develop new AI tools and workflows, it’s imperative to understand how security and compliance factor into the many ways data is used to power this innovation. The algorithms that underlie AI systems are designed to identify patterns and trends. This could be anything from an understanding of language and facts used by generative AI models to key differences between legitimate files and potential malware.

When training AI systems, it’s essential to ensure that they’re learning from the right types of data. Depending on the type of model you’re creating, that data could be highly sensitive and subject to certain security and privacy standards. Beyond checks to ensure training data is high-quality and accurate, considerations around regulatory compliance and data security should be prioritized to minimize business risks.

Compliant data makes healthy models

While identifying underlying patterns and trends during AI model training, a model can also absorb specific information and regurgitate it later with unintended effects. For example, major historical documents commonly show up as AI-generated in testing tools because the text is included multiple times in training data, and GenAI systems learn their exact phrasing as a common “pattern” in the language.

This ability to absorb specific information and provide it in response to prompts becomes problematic if an AI system is trained on sensitive data, such as:

Personally identifiable information (PII)

Intellectual property

Business secrets

Confidential information

In these cases, an AI may use sensitive or protected information in its responses. You wouldn’t want an AI trained on health records to expose a specific patient’s information, or an AI trained as a corporate assistant to reveal company secrets to anyone who prompts it. Since many modern AI models are non-explainable systems, it’s infeasible to determine whether they contain sensitive information that they can leak in response to the right prompt. As a result, it’s vital to scrub potential training data of any security risks before it reaches the training stage.

Compliance monitoring

AI training data sets are often large, and the cloud is often the logical choice for hosting them. To do so, organizations must implement security controls to comply with:

Data privacy laws: If AI training data sets include PII, they must be protected in compliance with the requirements of GDPR and other applicable regulations. For example, regulations will likely require data encryption and least privilege access management for these data sets.

Industry standards: Depending on the industry and the type of data in question, additional regulations may apply. For example, the Payment Card Industry Data Security Standard (PCI DSS) mandates certain security controls for payment cardholders’ personal data.

AI regulations: As AI becomes ubiquitous, various jurisdictions have implemented regulations managing the use of AI. Data security and regulatory compliance strategies for AI training data must also abide by these regulations.

Internal policies: Companies are increasingly adopting internal governance policies to manage their use of AI and exposure to the associated risks. Cloud environments need to ensure adequate data security and visibility to meet these requirements.

AI security standards to know

Several jurisdictions have already implemented AI security and safety standards, and others are currently in the works. The most significant of these is the EU Artificial Intelligence Act, which mandates risk assessments and data governance strategies for high-risk AI systems to protect against potential security risks, misuse, and bias.

Numerous frameworks have also been implemented to aid organizations in developing secure, trustworthy AI systems. Some examples include the NIST AI Risk Management Framework (RMF) in the US and the ENISA Framework for AI Cybersecurity Practices (FAICP) in the EU.

Protection of sensitive information

The data security risks of AI aren’t limited to the inclusion of sensitive data in AI training data. Employees using external AI models, such as ChatGPT, shouldn’t provide any confidential or sensitive information to them.

These models perform continuous learning, using information from users’ prompts to refine and enhance their models. As a result, information provided as part of an employee’s prompt has the potential to be included in the tool’s response to another user, breaking confidentiality.

As a result, organizations implement policies and training to inform employees about the risks of AI usage and associated best practices. For example, when using an AI model to process sensitive data, use tokenization to conceal the actual data and ensure that the provided information has no exploitable value to another user.

AI data security

AI training data faces various potential threats, such as data leaks and adversarial attacks. Implementing data security best practices is vital to manage these risks and achieve compliance with applicable regulations. Key elements of an AI data security strategy include:

Governance: Visibility and control over AI usage and training data are essential to protect against misuse and various data security threats. Organizations should have policies and controls in place to manage access to training data and ensure data integrity and compliance.

Data encryption: AI training data commonly includes sensitive and valuable information, making it a prime target for attack. Encryption of data at rest and in transit helps to limit access to authorized users.

Strong authentication: Attackers may attempt to access AI training data to steal, poison, or delete it. Multi-factor authentication (MFA) and Multi-User Authentication reduce this risk by requiring multiple types of authentication factors and/or the consent of multiple authorized parties to manipulate AI training data.

Least privilege access: Insider threats and compromised accounts can be used to target AI training data. Least privilege access controls limit users’ and applications’ access to the minimum possible, reducing the damage that can be done by a malicious insider.

AI-enhanced monitoring: By definition, a cyberattack requires an attacker to take unusual and malicious actions to harm the business. Implementing AI-enhanced behavioral analytics can help an organization to identify anomalous actions that may point to a potential attack.

Data labeling: AI training data should be clearly labeled with its purpose and associated level of sensitivity. This aids in aligning data security controls to associated policies and standards.

Securing AI training data in the cloud

An AI model is only as good as its training data, which means high-quality models require high-quality — and potentially sensitive — training data. This introduces various data security and regulatory compliance risks as companies work to balance the benefits of using an AI model with the need to protect their training data from potential theft or corruption.

Due to the significant storage requirements of AI training data, cloud storage is the logical host for this data, making cloud security a key component of an organization’s AI security strategy. Companies should look for AI data storage solutions that are not only cost-effective but also offer the encryption, access management, and data visibility and control capabilities that they need.

Secure, compliant AI training data storage

Ensure AI training data security, compliance, and accessibility with cost-effective cloud storage

Explore AI Solutions

DATA PROTECTION4 questions CISOs should ask their storage team before the next breach

Most Recent

Unlocking partner growth in Australia and New Zealand with Wasabi Account Control Manager

Discover how managed service providers in ANZ are reducing complexity, improving compliance, and scaling profitably with Wasabi Account Control Manager.

EDUCAUSE 2026: Connection, trust, and the role of storage in higher ed IT

EDUCAUSE 2026 highlighted connection and trust as higher ed’s top IT priorities. Explore how smarter storage builds the foundation for both.

The AI paradox in higher education: Bridging the gap between innovation and infrastructure

Explore how AI is transforming teaching, research, and operations across campuses, and why rapid adoption is exposing major challenges for higher-ed IT.

Storage Insights from the Storage Experts

Storage insights sent direct to your inbox.