Skip to content

tmmsunny012/data_masking_for_GDPR_data_engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Data Masking for GDPR Policy Alignment

This project demonstrates practical data masking techniques using Python to align with GDPR (General Data Protection Regulation) policies. It serves as a learning resource for Data Engineers and Developers to understand how to protect Personal Identifiable Information (PII) in datasets.

🛡️ GDPR Context

Under GDPR (specifically Article 32), organizations must implement appropriate technical and organizational measures to ensure a level of security appropriate to the risk. Key concepts include:

  • Pseudonymization: Processing personal data such that it can no longer be attributed to a specific data subject without the use of additional information (which is kept separately).
  • Data Minimization: Collecting and processing only the data that is necessary for the purpose.
  • Integrity and Confidentiality: Ensuring data is protected against unauthorized access.

🚀 Project Overview

The included Python script (data_masking_demo.py) generates dummy PII data and applies several masking techniques to transform it into a GDPR-compliant format for analysis (e.g., for a Data Warehouse or Analytics environment).

Techniques Demonstrated

  1. Pseudonymization (Hashing):

    • Applied to: User ID
    • Method: SHA-256 Hashing with salt.
    • Why: Allows tracking unique users without exposing their real IDs.
  2. Redaction:

    • Applied to: Email, Phone Number
    • Method: Partial masking (e.g., j*****[email protected]).
    • Why: Hides direct contact info while preserving domain or format for validation.
  3. Generalization (Bucketing):

    • Applied to: Date of Birth -> Age Group
    • Method: Converting specific dates into ranges (e.g., 30-39).
    • Why: Reduces precision to prevent re-identification while keeping the data useful for demographic analysis.
  4. Suppression:

    • Applied to: IP Address
    • Method: Dropping the column entirely.
    • Why: If the data isn't needed for the specific analysis, remove it (Data Minimization).
  5. Perturbation:

    • Applied to: Salary
    • Method: Rounding values.
    • Why: Reduces precision of sensitive financial data.

🛠️ Setup and Usage

Prerequisites

  • Python 3.x
  • pip (Python package manager)

Installation

  1. Clone this repository (or download the files).

  2. Create a virtual environment (recommended):

    # Windows
    python -m venv venv
    .\venv\Scripts\activate
    
    # macOS/Linux
    python3 -m venv venv
    source venv/bin/activate
  3. Install the required dependencies:

    pip install -r requirements.txt

Running the Demo

Execute the Python script:

python data_masking_demo.py

📊 Example Output

When you run the script, you will see a comparison of the Original and Masked data.

Original Data (Snippet):

User ID                               Full Name           Email                   Phone Number    Date of Birth   IP Address       Salary
bdd640fb-0667...                      Daniel Doyle        [email protected] 538.990.8386   1982-03-12      192.168.1.1      38420

Masked Data (Snippet):

User ID                               Full Name           Email                   Phone Number    Age Group       Salary
5ce9bbbe5a61...                       Daniel Doyle        g**********[email protected] *******8386    40-49           38000

Note: The IP Address column is removed, Date of Birth is replaced by Age Group, and User ID is hashed.

⚠️ Disclaimer

This code is for educational purposes. In a production environment, ensure you manage encryption keys securely (e.g., using a Key Management Service) and follow your organization's specific compliance requirements.

About

Practical data masking techniques using Python to align with GDPR (General Data Protection Regulation) policies.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages