AI Deployments and Personal Data – The Key Role of Document Anonymization - Blog

Large language models (LLMs) are increasingly becoming a priority for organizations – as tools for document analysis, building internal knowledge bases, supporting teams’ daily work, and improving decision-making processes. In practice, almost every such initiative quickly encounters a fundamental problem: documents usually contain personal data, and personal data should not be processed directly by language models.

The scale of this challenge is confirmed by a report commissioned by the President of Urząd Ochrony Danych Osobowych titled “Study on Organizations’ Needs in the Use of Artificial Intelligence”, which shows that:

• 41% of organizations do not see or cannot assess the link between AI development and personal data processing,

• among organizations already using AI, this figure rises to 58.5%,

• as many as 95.9% of respondents consider themselves unprepared or uncertain in applying GDPR in the context of AI.

The report’s conclusions clearly indicate the existence of a systemic gap in the use of personal data in AI. This is not primarily a technological issue, but rather an organizational, procedural, and competency challenge. Moreover, ignoring this problem can lead to significantly higher AI implementation costs as well as serious legal consequences.

The UODO report highlights that regulatory uncertainty and legal responsibility – not technology itself – are among the main barriers to AI adoption in organizations.

GDPR perspective: purpose limitation and AI

To fully understand the problem, it is necessary to look at it through the lens of GDPR, where the principle of purpose limitation plays a key role. Personal data may only be processed for the purpose for which it was originally collected. Using HR documents, contracts, official correspondence, or internal communications for:

• training AI models,

• proof-of-concept testing,

• building RAG systems,

• internal LLM-based assistants,

very rarely fits within the original legal basis for processing those data.

Once personal data enters AI models, removing it in practice becomes impossible. This can lead to serious legal risks and, in many cases, has already resulted in abandoning AI projects after significant budgets had been spent with no usable outcome.

Anonymization as an accelerator of AI projects

One of the key findings of the report is the demand for practical, operational tools – rather than additional abstract guidelines or purely formal compliance efforts.

Respondents rated the following most highly:

• checklists,

• model DPIAs,

• decision maps,

• repositories of risks and best practices.

In this context, automated document anonymization – as a step preceding the use of documents in AI systems – perfectly addresses organizational needs and is one of the fundamental requirements for secure AI deployment.

Anonymization: • reduces regulatory risk,

• simplifies DPIA processes,

• clarifies responsibilities,

• allows faster transition from idea to implementation.

Paradoxically, it not only increases safety but also significantly accelerates AI adoption.

Local document anonymization as part of privacy by design

In the qualitative part of the report, both private companies and public institutions expressed concerns related to:

• data security,

• trust in large AI providers,

• transferring documents to cloud environments.

These concerns are justified, which is why there is a growing demand for local (on-premise) solutions that do not send data outside the organization’s infrastructure – naturally increasing cybersecurity. This approach directly supports the principles of privacy by design and privacy by default. In the case of full, irreversible anonymization, documents may even fall outside the scope of GDPR, significantly simplifying AI adoption.

The optimal pipeline: Document → Anonymization → LLM

Based on the report’s recommendations, real-world implementation experience, and market trends, a simple and secure standard can now be defined:

Input: PDF, DOCX, JPG documents, scans, etc.

OCR + NLP (adapted to local language and context): text extraction and detection of personal data with linguistic and contextual understanding.

Transformation: anonymization or pseudonymization according to business rules.

Audit: detection reports, operation logs, data for DPIA and records of processing activities.

AI usage: indexing, embeddings, RAG systems, or LLM analysis.

This is exactly the type of “practical architecture” organizations expect, as highlighted in the report.

What AI and compliance teams really need today

The UODO report identifies three priority areas:

Technical aspects of GDPR compliance in AI – tools that help recognize when AI involves personal data processing.
Data quality and model performance – preprocessing, leakage control, and documentation of limitations.
The relationship between GDPR and AI Act – a clear separation of data governance and AI governance.

Automated document anonymization is one of the few elements that effectively connects all three areas.

Conclusions: anonymization as the foundation of secure AI

The report commissioned by the President of UODO clearly shows that AI is becoming increasingly widespread, but awareness of responsibilities related to personal data is essential for responsible and effective deployment. Organizations need simple, repeatable, and auditable mechanisms for risk reduction.

Anonymizing documents before using them in LLM systems is no longer an optional enhancement – it is a fundamental requirement for responsible and secure innovation.

The local document anonymization and personal data detection technology developed by Mycroft Solutions has been built from the very beginning with a secure-by-design approach.

Our solutions enable organizations to safely prepare documents for AI and LLM workflows – in full compliance with GDPR, without cloud processing, with complete data control, and with deep understanding of the Polish language and the ability to adapt to other languages and contexts.

Source

Report “Study on Organizations’ Needs in the Use of Artificial Intelligence”, President of UODO, 2025/2026 (CC BY 4.0). The report is exploratory and aims to identify gaps and organizational needs.