A metadata-only course recommendation engine built on the Coursera Course Dataset 2023 — no user interaction history required.
With thousands of online courses on Coursera, learners struggle to find courses that match their skill goals, background, and preferences. This project solves that by building a Hybrid Course Recommender System that operates entirely on course-level metadata — no historical user interaction data needed.
The system combines three techniques in a pipeline:
- Content-Based Filtering (CBF) — TF-IDF vectorization of course text with cosine similarity scoring against the user's skill-interest query.
- Knowledge-Based Filtering (KBF) — Hard constraint satisfaction on user-specified requirements (difficulty, certificate type, organization).
- Penalty-Based Re-Ranking (novel contribution) — A multiplicative penalty function that demotes courses with poor ratings or very low enrollment.
Every recommendation is fully explainable: CBF score, hybrid score, penalty factor, and penalty reasons are all shown to the user.
| Property | Value |
|---|---|
| Name | Coursera Course Dataset 2023 |
| Source | Kaggle — tianyimasf/coursera-course-dataset |
| Records | 993 courses |
| Key Fields | course_title, course_description, course_skills, course_difficulty, course_rating, course_students_enrolled, course_certificate_type, course_organization |
| User Data | None (metadata-only) |
Each course is represented as a TF-IDF vector built from its title, description, and skill tags. The vectorizer uses:
- Unigrams and bigrams (
ngram_range=(1,2)) - Sublinear TF scaling (
log(1+tf)) - Top 10,000 features with English stop-word removal
A user query (e.g., "machine learning, Python, data analysis") is vectorized in the same space, and cosine similarity produces a relevance score in [0, 1] for each course.
Structured Token Enrichment: The system automatically appends structured tokens (e.g., skill_python, level_beginner) to the query to prioritize categorical matches over fuzzy text.
Users specify hard constraints at query time:
- Difficulty level (Beginner / Intermediate / Advanced / Mixed)
- Certificate type
- Preferred organization (optional)
Courses failing any constraint are excluded entirely before scoring — fully rule-based and explainable.
KBF acts as a hard pre-filter. Dynamic weighting logic adjusts based on user profile maturity:
- Cold Start: Prioritizes CBF and Popularity signals.
- Active User: Gradually shifts weight toward Collaborative Filtering (up to 50%) as the user provides more star ratings.
A multiplicative penalty is applied after hybrid scoring to prevent low-quality but highly similar content from ranking first. Each penalty is shown to the user with a plain-language explanation (e.g., "Rating (3.1) is below quality threshold (3.5)").
When logged in, users can star/rate courses (1–5 stars). A tunable CF weight slider (0–0.5) controls how strongly community ratings influence the final ranking.
pip install -r requirements.txtpython app.pyThen open your browser at http://localhost:5000.
- Enter a skill-interest query (e.g., "machine learning python")
- Optionally set hard constraints: difficulty level, certificate type, organization
- Browse results — each card shows CBF score, hybrid score, penalty factor, and penalty reason
├── app.py # Main application entry point
├── recommender/
│ ├── cbf.py # Content-Based Filtering (TF-IDF + cosine similarity)
│ ├── kbf.py # Knowledge-Based Filtering (constraint satisfaction)
│ ├── hybrid.py # Hybrid score fusion
│ ├── penalty.py # Penalty-based re-ranking
│ └── collaborative.py # Collaborative filtering layer
├── data/
│ └── coursera_courses.csv
├── evaluation/
│ └── metrics.py # Precision, Recall, NDCG, ILD, Coverage
├── requirements.txt
└── README.md