UCI Search Engine

A comprehensive search engine implementation with web crawling, indexing, and retrieval capabilities for UCI domain content.

Overview

This project implements a full-stack search engine solution that includes:

Web Crawler: Enhanced crawler for UCI domains with Python integration
Document Indexing: Efficient inverted index construction with TF-IDF scoring
Search Interface: FastAPI-based REST API with React frontend
Multi-language Support: Python (primary) and Rust (experimental) implementations

Project Structure

UCI-Search-Engine/
├── python/                    # Python implementation (primary)
│   ├── src/                  # Source code
│   │   ├── main.py          # Index builder
│   │   ├── indexer.py       # Inverted index implementation
│   │   ├── tokenizer.py     # Text tokenization and stemming
│   │   ├── query.py         # Search and retrieval
│   │   ├── similarity.py    # Duplicate detection
│   │   └── run_queries.py   # API server
│   └── requirements.txt     # Python dependencies
├── client/                   # React frontend
│   ├── src/                 # Frontend source
│   └── package.json         # Node dependencies
├── rust/                    # Rust implementation (experimental)
│   ├── src/                 # Rust source code
│   └── Cargo.toml          # Rust dependencies
└── UCI-Web-Crawler-master/  # Enhanced web crawler
    ├── enhanced_launch.py   # Enhanced crawler launcher
    ├── python_connector.py  # Python integration
    └── ENHANCED_README.md   # Crawler documentation

Installation & Setup

Prerequisites

Python 3.8+
Node.js 16+ (for frontend)
Rust 1.70+ (optional, for Rust implementation)

Python Backend Setup

Install Python dependencies:

cd python
pip install -r requirements.txt

Download NLTK data (first time only):

import nltk
nltk.download('punkt')
nltk.download('stopwords')

Frontend Setup

Install Node.js dependencies:
```
cd client
npm install
```

Rust Setup (Optional)

Build Rust implementation:
```
cd rust
cargo build --release
```

Usage Guide

Building the Search Index

Step 1: Create Partial Indexes

cd python
python src/main.py

This processes documents in batches and creates partial indexes in the indexes/ directory.

Step 2: Merge Partial Indexes

python src/lazy_merger.py

This combines all partial indexes into merged indexes organized by letter.

Running the Search Engine

Option 1: API Server with Web Interface

cd python/src
python run_queries.py

Then open your browser to http://localhost:3000 for the web interface.

Option 2: Frontend Development Server

# Terminal 1: Start API server
cd python/src
python run_queries.py

# Terminal 2: Start frontend dev server
cd client
npm run dev

Frontend will be available at http://localhost:5173.

Web Crawling (Optional)

To crawl new content and integrate with the search engine:

cd UCI-Web-Crawler-master

# Basic crawling
python enhanced_launch.py

# With Python search integration
python enhanced_launch.py --python_integration

# Test integration
python integration_demo.py

Features

Search Capabilities

Boolean AND queries: All terms must be present in results
TF-IDF scoring: Documents ranked by relevance
Stemming: Handles word variations (running → run)
Stop word filtering: Removes common words for better relevance
Fast retrieval: Optimized inverted index structure

Quality Controls

Duplicate detection: SimHash-based similarity detection
Content filtering: Size and quality thresholds
Domain validation: UCI-specific domain filtering
Robots.txt compliance: Respects crawler restrictions

Performance Features

Batch processing: Efficient memory usage during indexing
Lazy merging: Memory-efficient index combination
Concurrent crawling: Multi-threaded web crawler
Caching: Response caching for better performance

Configuration

Search Engine Configuration

Key settings in python/src/constants.py:

BATCH_SIZE: Documents per processing batch (default: 5000)
MAX_SEARCH_RESULTS: Maximum results returned (default: 5)
TOTAL_DOCUMENT_COUNT: Total documents for IDF calculation

Crawler Configuration

Configure crawling in UCI-Web-Crawler-master/enhanced_config.ini:

[CRAWLER]
MAX_PAGES = 10000
POLITENESS = 0.5
THREADCOUNT = 4

[PYTHON_INTEGRATION]
ENABLED = true
BATCH_SIZE = 1000
REAL_TIME_INDEXING = false

Performance Metrics

Index Statistics

Document processing: ~1000 documents/minute
Index size: Typically 10-20% of original content size
Query response time: <100ms for most queries
Memory usage: ~2GB during indexing, ~500MB during search

Search Quality

Precision: High relevance for UCI domain content
Recall: Comprehensive coverage of indexed content
Duplicate detection: >95% accuracy with SimHash

Testing & Development

Run Tests

cd python
python -m pytest tests/

Debug Mode

# Enable verbose logging
python src/run_queries.py --log_level DEBUG

# Test crawler integration
cd UCI-Web-Crawler-master
python integration_demo.py

Development Tools

Logging: Comprehensive logging with configurable levels
Statistics: Performance and quality metrics tracking
Error handling: Graceful error recovery and reporting

API Reference

Search Endpoint

POST /search
Content-Type: application/json

{
  "query": "information retrieval",
  "search_type": "name"
}

Response:

{
  "results": [
    {
      "url": "https://www.ics.uci.edu/~someurl",
      "content": "Page content..."
    }
  ],
  "time": 45
}

Known Limitations

Query Types: Currently supports AND queries only
Language: Optimized for English content
Domain Scope: Limited to UCI domains (ics, cs, informatics, stat)
Rust Implementation: Experimental and incomplete

Future Enhancements

OR and phrase queries support
Auto-complete suggestions
Advanced ranking algorithms (PageRank, BM25)
Real-time indexing for live content updates
Multi-language support
Result clustering and faceted search

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Guidelines

Follow existing code style and patterns
Add comprehensive error handling and logging
Update documentation for new features
Test changes thoroughly before submitting

Recent Improvements

Quick Wins Implemented

Fixed indexer return values for proper token counting
Added comprehensive logging throughout the system
Enhanced error handling for file operations
Created constants file for better configuration management
Fixed typos and improved code quality

Enhanced Crawler

Python integration for automatic indexing
Better error handling and content quality checks
Configurable processing with multiple output formats
Real-time and batch processing modes
Comprehensive documentation and testing tools

License

This project is part of an academic assignment for Information Retrieval coursework.

Support

For issues and questions:

Check the logs for error details
Review configuration settings
Test individual components separately
Consult the enhanced crawler documentation in UCI-Web-Crawler-master/ENHANCED_README.md

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
UCI-Web-Crawler-master		UCI-Web-Crawler-master
client		client
python		python
rust		rust
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
TEST.txt		TEST.txt
TODO.md		TODO.md
debug.txt		debug.txt

Folders and files

Latest commit

History

Repository files navigation

UCI Search Engine

Overview

Project Structure

Installation & Setup

Prerequisites

Python Backend Setup

Frontend Setup

Rust Setup (Optional)

Usage Guide

Building the Search Index

Step 1: Create Partial Indexes

Step 2: Merge Partial Indexes

Running the Search Engine

Option 1: API Server with Web Interface

Option 2: Frontend Development Server

Web Crawling (Optional)

Features

Search Capabilities

Quality Controls

Performance Features

Configuration

Search Engine Configuration

Crawler Configuration

Performance Metrics

Index Statistics

Search Quality

Testing & Development

Run Tests

Debug Mode

Development Tools

API Reference

Search Endpoint

Known Limitations

Future Enhancements

Contributing

Development Guidelines

Recent Improvements

Quick Wins Implemented

Enhanced Crawler

License

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages