Securing AI · · 4 min read

Generative AI: The Hidden Data Security Minefield

The data pipeline for these AI models is fraught with vulnerabilities, making each stage a potential target for security breaches. For security professionals, addressing these risks is crucial.

Generative AI: The Hidden Data Security Minefield
GenAI DataEnv by Philip Dursey and leonardo.ai, the AI Security Pro human machine (rendering) team

Generative AI technologies, while fascinating in their outputs—such as realistic images and human-like text—pose significant security risks concerning the data inputs. The data pipeline for these AI models is fraught with vulnerabilities, making each stage a potential target for security breaches. For security professionals, addressing these risks is crucial.

Securing Data Collection for Generative AI

Data collection is often the most vulnerable stage in the AI data pipeline. Sources include web scraping, IoT devices, and confidential business processes.

Data Integrity and Secure Collection: Ensuring data integrity at the collection point is critical. For instance, homomorphic encryption allows computations on encrypted data without decryption, safeguarding sensitive information. This method, however, is computationally intensive, requiring careful planning for implementation. Google has implemented Zero Trust Architecture (ZTA), continuously authenticating and authorizing every data source and collector, ensuring secure data transmission across its networks.

Data Preprocessing

Preprocessing involves cleaning, normalizing, and transforming data for model training, a stage prone to introducing biases and data leakage.

Anonymization:  Implementing rigorous data anonymization is essential. Differential privacy techniques, such as those used by Uber, add noise to data, preventing re-identification of individuals while maintaining data utility . Techniques like named entity recognition (NER) and adversarial de-identification are also employed to protect sensitive information from being reverse-engineered.

Data Storage

Storing large datasets for AI training is a significant risk, as these data troves are prime targets for attackers.

Advanced Storage Security: Data sharding with distributed encryption keys ensures that no single breach compromises the entire dataset. Dropbox, for instance, uses Data Security Posture Management (DSPM) tools to monitor and secure data, providing real-time insights into potential vulnerabilities . Hardware security modules (HSMs) further protect encryption keys, enhancing data security.

Data Transmission

Data in transit is particularly vulnerable. While end-to-end encryption is standard, evolving threats require more advanced measures.

Future-Proofing Encryption: As quantum computing evolves, adopting quantum-resistant encryption algorithms like lattice-based cryptography becomes crucial. American Binary and IBM are actively developing these technologies to protect against future quantum threats. Additionally, secure multiparty computation (SMPC) protocols enable collaborative computation without exposing individual data inputs, as demonstrated by companies like Google in federated learning projects.

Data Provenance

Understanding the origin and transformation history of data is crucial for both security and regulatory compliance.

Robust Data Lineage Systems: Financial institutions, for example, use data lineage systems to comply with regulations, ensuring transparency and traceability in data handling. Such systems log every data transformation and access, providing a comprehensive, verifiable, audit trail.

Training Data Management

Managing training data involves not only ensuring model quality but also maintaining security.

Access Controls and Anomaly Detection: Strict access controls, based on the principle of least privilege, are essential. Microsoft, for example, uses behavioral analytics to detect unusual data access patterns, signaling potential insider threats or account compromises. In federated learning, secure aggregation protocols ensure that global model updates do not compromise individual data contributions. Additionally, synthetic data generated by GANs provides a safe way to train models when real data is limited or sensitive.

Balancing Data Accessibility and Security: There is a critical need to balance data accessibility for innovation with stringent security measures. This balance is evident in academic research, where data must be protected without hindering scientific progress.

Implementation and Resistance: Implementing these measures requires a shift in organizational thinking about data infrastructure. A thorough audit of the AI data pipeline is necessary, focusing initially on robust data provenance and access controls. This foundational step can lead to significant security improvements in data preprocessing and storage.

Resistance is likely, particularly from researchers and business units seeking quick results. However, the goal is to facilitate innovation while safeguarding critical assets.

The future of AI security depends not just on protecting models but on securing the entire data ecosystem. As generative AI becomes more prevalent, the value of underlying data will grow, making comprehensive data security an imperative for organizations seeking a competitive edge.

Where to Start: Begin by auditing your most sensitive AI projects, tracing data back to its source. This approach defines your new attack surface, highlighting the importance of securing the entire data journey.


References:

[1] Huang, J., Deng, Y., & Liu, Y. (2022). Practical solutions in fully homomorphic encryption: a survey analyzing existing acceleration methods. Cybersecurity, 5 (1), 1-19.

[2] Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006). Calibrating Noise to Sensitivity in Private Data Analysis. In S. Halevi & T. Rabin (Eds.), Theory of Cryptography (Vol. 3876, pp. 265-284). Springer, Heidelberg. doi: [10.1007/11681878_14](https://doi.org/10.1007/11681878_14).

[3] Cloud Security Alliance. "Guide to Data Security Posture Management (DSPM)." Cloud Security Alliance. Available at: [cloudsecurityalliance.org](https://cloudsecurityalliance.org/blog/2021/12/07/guide-to-data-security-posture-management-dspm/).

[4] SMPAI: Secure Multi-Party Computation for Federated Learning, J.P. Morgan. Available at: [J.P. Morgan](https://www.jpmorgan.com/global/technology/smpai).

[5] Cloud-SMPC: Two-Round Multilinear Maps Secure Multiparty Computation, Journal of Cloud Computing. Available at: [SpringerOpen](https://journalofcloudcomputing.springeropen.com/articles/10.1186/s13677-020-00185-7).

[6] What is Secure Multiparty Computation? SMPC/MPC Explained, Inpher. Available at: [Inpher](https://www.inpher.io/what-is-secure-multiparty-computation).

[7] QuestionPro. "Data Governance Framework: A Complete Guide." QuestionPro.

[8] Atlan. "Unlocking Data Governance with Data Lineage." Atlan. Available at:

[9] Astera. "Data Governance in Financial Services: A Complete Analysis." Astera.

[10] Kumar, P. (2024). Generative Adversarial Networks (GANs): Creating Realistic Synthetic Data. Dataspace Insights. Available at: [Dataspace Insights](https://dataspaceinsights.com/generative-adversarial-networks-creating-realistic-synthetic-data).

[11] Figueira, A., & Vaz, B. (2022). Survey on Synthetic Data Generation, Evaluation Methods and GANs. Mathematics, 10(15), 2733.

[12] Figueira, A., & Vaz, B. (2023). Machine Learning for Synthetic Data Generation: A Review. arXiv.org. Available at: [arXiv](https://arxiv.org/abs/2302.04062).

Read next