This project demonstrates practical data masking techniques using Python to align with GDPR (General Data Protection Regulation) policies. It serves as a learning resource for Data Engineers and Developers to understand how to protect Personal Identifiable Information (PII) in datasets.
Under GDPR (specifically Article 32), organizations must implement appropriate technical and organizational measures to ensure a level of security appropriate to the risk. Key concepts include:
- Pseudonymization: Processing personal data such that it can no longer be attributed to a specific data subject without the use of additional information (which is kept separately).
- Data Minimization: Collecting and processing only the data that is necessary for the purpose.
- Integrity and Confidentiality: Ensuring data is protected against unauthorized access.
The included Python script (data_masking_demo.py) generates dummy PII data and applies several masking techniques to transform it into a GDPR-compliant format for analysis (e.g., for a Data Warehouse or Analytics environment).
-
Pseudonymization (Hashing):
- Applied to:
User ID - Method: SHA-256 Hashing with salt.
- Why: Allows tracking unique users without exposing their real IDs.
- Applied to:
-
Redaction:
- Applied to:
Email,Phone Number - Method: Partial masking (e.g.,
j*****[email protected]). - Why: Hides direct contact info while preserving domain or format for validation.
- Applied to:
-
Generalization (Bucketing):
- Applied to:
Date of Birth->Age Group - Method: Converting specific dates into ranges (e.g.,
30-39). - Why: Reduces precision to prevent re-identification while keeping the data useful for demographic analysis.
- Applied to:
-
Suppression:
- Applied to:
IP Address - Method: Dropping the column entirely.
- Why: If the data isn't needed for the specific analysis, remove it (Data Minimization).
- Applied to:
-
Perturbation:
- Applied to:
Salary - Method: Rounding values.
- Why: Reduces precision of sensitive financial data.
- Applied to:
- Python 3.x
pip(Python package manager)
-
Clone this repository (or download the files).
-
Create a virtual environment (recommended):
# Windows python -m venv venv .\venv\Scripts\activate # macOS/Linux python3 -m venv venv source venv/bin/activate
-
Install the required dependencies:
pip install -r requirements.txt
Execute the Python script:
python data_masking_demo.pyWhen you run the script, you will see a comparison of the Original and Masked data.
Original Data (Snippet):
User ID Full Name Email Phone Number Date of Birth IP Address Salary
bdd640fb-0667... Daniel Doyle [email protected] 538.990.8386 1982-03-12 192.168.1.1 38420
Masked Data (Snippet):
User ID Full Name Email Phone Number Age Group Salary
5ce9bbbe5a61... Daniel Doyle g**********[email protected] *******8386 40-49 38000
Note: The
IP Addresscolumn is removed,Date of Birthis replaced byAge Group, andUser IDis hashed.
This code is for educational purposes. In a production environment, ensure you manage encryption keys securely (e.g., using a Key Management Service) and follow your organization's specific compliance requirements.