A production-quality, state-of-the-art implementation of BiLSTM-CRF for Named Entity Recognition (NER) in biomedical text. Built entirely from scratch using PyTorch, this project targets the BC5CDR dataset to recognize Chemical and Disease entities.
- From-Scratch Implementation: All components (CRF, attention, highway networks) implemented using PyTorch primitives
- State-of-the-Art Architecture: BiLSTM + CRF + Character CNN + Self-Attention + GloVe embeddings
- Production Quality: Comprehensive testing, error analysis, visualization, and ablation studies
- Expected Performance: ~87% F1 score on BC5CDR dataset
┌─────────────────────────────────────────────────────────────────────────┐
│ BiLSTM-CRF-Attention Model │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Word Tokens │ │ Characters │ │
│ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────────────┐ │
│ │ GloVe │ │ Character CNN │ │
│ │ Embeddings │ │ (Multi-kernel 2,3,4) │ │
│ │ (100d) │ │ ▼ │ │
│ └──────┬───────┘ │ Highway Networks │ │
│ │ │ ▼ │ │
│ │ │ Char Features (50d) │ │
│ │ └──────────┬───────────┘ │
│ │ │ │
│ └──────────┬───────────┘ │
│ │ Concatenate │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Bidirectional LSTM │ │
│ │ (2 layers, 256) │ │
│ └──────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Self-Attention │ │
│ │ (4 heads + LayerN) │ │
│ └──────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Linear Layer │ │
│ │ (hidden → num_tags) │ │
│ └──────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ CRF Layer │ │
│ │ (Viterbi Decoding) │ │
│ └──────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ BIO Tag Sequence │ │
│ │ O B-Chem I-Chem O │ │
│ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
| Component | Implementation | Description |
|---|---|---|
| CRF Layer | src/models/layers.py |
Forward-backward algorithm, Viterbi decoding |
| Character CNN | src/models/layers.py |
Multi-kernel (2,3,4) with highway networks |
| Self-Attention | src/models/attention.py |
Multi-head attention with layer normalization |
| Highway Networks | src/models/layers.py |
Gradient-friendly deep networks |
| GloVe Loader | src/utils/embeddings.py |
Pre-trained embedding integration |
- Learning rate warmup with linear decay
- Class weight balancing for imbalanced labels
- TensorBoard logging
- Comprehensive checkpointing with resume support
- Early stopping
- Detailed error analysis (boundary, type, missed, spurious)
- Per-entity-type metrics
- Performance by entity length
- Confusion matrix visualization
- HTML prediction visualization
.
├── config/
│ └── config.yaml # Complete configuration
├── data/
│ ├── raw/ # BC5CDR PubTator files
│ ├── processed/ # BIO-formatted data
│ └── embeddings/ # GloVe embeddings
├── src/
│ ├── data/
│ │ ├── bc5cdr_parser.py # BC5CDR dataset parser
│ │ ├── preprocess.py # Data preprocessing
│ │ └── dataset.py # PyTorch Dataset/DataLoader
│ ├── models/
│ │ ├── bilstm_crf.py # Main BiLSTM-CRF model
│ │ ├── layers.py # CRF, CharCNN, Highway
│ │ ├── attention.py # Self-attention mechanism
│ │ └── baseline_tagger.py # Baseline model
│ ├── training/
│ │ ├── train.py # Training pipeline
│ │ └── eval.py # Evaluation script
│ └── utils/
│ ├── vocab.py # Word, Label, Char vocabularies
│ ├── metrics.py # Entity-level metrics
│ ├── embeddings.py # GloVe loader
│ ├── analysis.py # Error analysis tools
│ ├── visualization.py # Plotting utilities
│ └── logging_utils.py # Logging utilities
├── scripts/
│ ├── ablation_study.py # Ablation experiments
│ ├── run_train.sh # Training script
│ └── run_eval.sh # Evaluation script
├── tests/
│ ├── test_crf.py # CRF unit tests
│ ├── test_model.py # Model unit tests
│ ├── test_vocab.py # Vocabulary tests
│ └── test_metrics.py # Metrics tests
├── artifacts/ # Saved models and vocabularies
├── reports/ # Outputs and visualizations
└── requirements.txt # Dependencies
- Python 3.8+
- PyTorch 1.9+
- CUDA (optional, for GPU training)
# Clone the repository
git clone <repository-url>
cd NLP_Project
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Download GloVe 100d embeddings
python -m src.utils.embeddings --output-dir data/embeddings --dim 100 --corpus 6B# Process BC5CDR dataset (downloads if not present)
python -m src.data.preprocess# Train with all features (char CNN, attention, GloVe)
python -m src.training.train --config config/config.yaml --model bilstm_crf# Evaluate on test set
python -m src.training.eval --config config/config.yaml --model bilstm_crf# Run command-line demo
python demo.py
# Or run web demo (requires: pip install gradio)
python app.py
# Then open http://localhost:7860 in your browser# Compare different model configurations
python scripts/ablation_study.py --config config/config.yaml --output results/ablation# Baseline BiLSTM (no CRF)
python -m src.training.train --model baseline_bilstm
# BiLSTM-CRF (default)
python -m src.training.train --model bilstm_crfpython -m src.training.train --resume artifacts/best_model.ptKey options in config/config.yaml:
model:
use_char_features: true # Enable character CNN
use_attention: true # Enable self-attention
use_pretrained_embeddings: true # Use GloVe
training:
use_warmup: true # Enable LR warmup
warmup_epochs: 2
use_class_weights: false # Balance imbalanced classes| Model Configuration | Precision | Recall | F1 Score |
|---|---|---|---|
| BiLSTM only | ~82% | ~78% | ~80% |
| BiLSTM + CRF | ~85% | ~83% | ~84% |
| BiLSTM + CRF + CharCNN | ~87% | ~85% | ~86% |
| BiLSTM + CRF + Attention | ~86% | ~85% | ~85.5% |
| Full Model (no pretrained) | ~87% | ~86% | ~86.5% |
| Full Model + GloVe | ~88% | ~86% | ~87% |
- CRF Layer: +3-4% F1 (enforces valid tag sequences)
- Character CNN: +2% F1 (captures morphological patterns)
- Self-Attention: +1-2% F1 (models long-range dependencies)
- GloVe Embeddings: +0.5-1% F1 (better word representations)
# Run all tests
python -m pytest tests/ -v
# Run specific test file
python -m pytest tests/test_crf.py -v- CRF forward pass and Viterbi decoding
- Model initialization and forward pass
- Vocabulary encoding/decoding
- Metrics computation
from src.utils.analysis import analyze_errors, print_error_analysis
analysis = analyze_errors(tokens_list, true_tags, pred_tags)
print_error_analysis(analysis)Output includes:
- Boundary errors (wrong entity span)
- Type errors (wrong entity type)
- Missed entities
- Spurious predictions
from src.utils.visualization import plot_training_curves, create_all_visualizations
# Plot training progress
plot_training_curves(history, 'training_curves.png')
# Generate all visualizations
create_all_visualizations(history, error_analysis, ...)# Start TensorBoard
tensorboard --logdir artifacts/tensorboardThe project includes two demo interfaces for testing the trained model:
python demo.pyFeatures:
- Interactive text input
- Color-coded entity highlighting (blue for Chemical, red for Disease)
- Entity list with types
- Example sentences included
Example session:
=== BiLSTM-CRF NER Demo ===
Enter text: Aspirin can cause gastrointestinal bleeding.
Results:
[Aspirin] can cause [gastrointestinal bleeding].
Entities found:
- Aspirin (Chemical)
- gastrointestinal bleeding (Disease)
Single-text mode:
python demo.py --text "Metformin treats diabetes"# Install Gradio first
pip install gradio
# Run web demo
python app.pyThen open http://localhost:7860 in your browser.
Features:
- Modern web interface
- Visual entity highlighting with colors
- Clickable example sentences
- Entity table output
from src.models.bilstm_crf import BiLSTMCRF
model = BiLSTMCRF(
vocab_size=10000,
num_tags=6,
embedding_dim=100,
hidden_size=256,
num_layers=2,
dropout=0.5,
pad_idx=0,
pretrained_embeddings=None, # Optional GloVe tensor
freeze_embeddings=False,
use_char_features=True,
num_chars=100,
char_embedding_dim=30,
char_hidden_size=50,
char_kernel_sizes=[2, 3, 4],
use_highway=True,
use_attention=True,
attention_heads=4,
attention_dropout=0.1
)
# Training
loss = model.loss(token_ids, label_ids, mask, char_ids)
# Inference
predictions = model.predict(token_ids, mask, char_ids)from src.models.layers import CRF
crf = CRF(num_tags=6, pad_idx=0)
# Compute negative log-likelihood
loss = crf(emissions, tags, mask)
# Viterbi decoding
best_paths = crf.decode(emissions, mask)-
Neural Architectures for Named Entity Recognition Lample et al. (2016) - BiLSTM-CRF architecture
-
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF Ma & Hovy (2016) - Character CNN + Highway networks
-
Attention Is All You Need Vaswani et al. (2017) - Multi-head attention
-
BioCreative V CDR Task Corpus Li et al. (2016) - BC5CDR dataset
# Reduce batch size in config.yaml
training:
batch_size: 16 # Default: 32# Disable optional features for faster training
model:
use_char_features: false
use_attention: false- Ensure GloVe embeddings are loaded correctly
- Try increasing
num_epochsto 50 - Use learning rate warmup
- Check data preprocessing for errors
This is an educational project for NLP coursework. Contributions welcome via:
- Bug reports and feature requests
- Code improvements and optimizations
- Documentation enhancements
- Additional test cases
This project is for educational purposes as part of an NLP course project.
- BC5CDR dataset creators at NCBI
- PyTorch team for the deep learning framework
- GloVe team at Stanford NLP
- Course instructors and teaching materials
Author: NLP Course Project Built From Scratch: All neural network components implemented using PyTorch primitives