Skip to content

joshtalla/UCI-Search-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

194 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UCI Search Engine

A comprehensive search engine implementation with web crawling, indexing, and retrieval capabilities for UCI domain content.

Overview

This project implements a full-stack search engine solution that includes:

  • Web Crawler: Enhanced crawler for UCI domains with Python integration
  • Document Indexing: Efficient inverted index construction with TF-IDF scoring
  • Search Interface: FastAPI-based REST API with React frontend
  • Multi-language Support: Python (primary) and Rust (experimental) implementations

Project Structure

UCI-Search-Engine/
├── python/                    # Python implementation (primary)
│   ├── src/                  # Source code
│   │   ├── main.py          # Index builder
│   │   ├── indexer.py       # Inverted index implementation
│   │   ├── tokenizer.py     # Text tokenization and stemming
│   │   ├── query.py         # Search and retrieval
│   │   ├── similarity.py    # Duplicate detection
│   │   └── run_queries.py   # API server
│   └── requirements.txt     # Python dependencies
├── client/                   # React frontend
│   ├── src/                 # Frontend source
│   └── package.json         # Node dependencies
├── rust/                    # Rust implementation (experimental)
│   ├── src/                 # Rust source code
│   └── Cargo.toml          # Rust dependencies
└── UCI-Web-Crawler-master/  # Enhanced web crawler
    ├── enhanced_launch.py   # Enhanced crawler launcher
    ├── python_connector.py  # Python integration
    └── ENHANCED_README.md   # Crawler documentation

Installation & Setup

Prerequisites

  • Python 3.8+
  • Node.js 16+ (for frontend)
  • Rust 1.70+ (optional, for Rust implementation)

Python Backend Setup

  1. Install Python dependencies:

    cd python
    pip install -r requirements.txt
  2. Download NLTK data (first time only):

    import nltk
    nltk.download('punkt')
    nltk.download('stopwords')

Frontend Setup

  1. Install Node.js dependencies:
    cd client
    npm install

Rust Setup (Optional)

  1. Build Rust implementation:
    cd rust
    cargo build --release

Usage Guide

Building the Search Index

Step 1: Create Partial Indexes

cd python
python src/main.py

This processes documents in batches and creates partial indexes in the indexes/ directory.

Step 2: Merge Partial Indexes

python src/lazy_merger.py

This combines all partial indexes into merged indexes organized by letter.

Running the Search Engine

Option 1: API Server with Web Interface

cd python/src
python run_queries.py

Then open your browser to http://localhost:3000 for the web interface.

Option 2: Frontend Development Server

# Terminal 1: Start API server
cd python/src
python run_queries.py

# Terminal 2: Start frontend dev server
cd client
npm run dev

Frontend will be available at http://localhost:5173.

Web Crawling (Optional)

To crawl new content and integrate with the search engine:

cd UCI-Web-Crawler-master

# Basic crawling
python enhanced_launch.py

# With Python search integration
python enhanced_launch.py --python_integration

# Test integration
python integration_demo.py

Features

Search Capabilities

  • Boolean AND queries: All terms must be present in results
  • TF-IDF scoring: Documents ranked by relevance
  • Stemming: Handles word variations (running → run)
  • Stop word filtering: Removes common words for better relevance
  • Fast retrieval: Optimized inverted index structure

Quality Controls

  • Duplicate detection: SimHash-based similarity detection
  • Content filtering: Size and quality thresholds
  • Domain validation: UCI-specific domain filtering
  • Robots.txt compliance: Respects crawler restrictions

Performance Features

  • Batch processing: Efficient memory usage during indexing
  • Lazy merging: Memory-efficient index combination
  • Concurrent crawling: Multi-threaded web crawler
  • Caching: Response caching for better performance

Configuration

Search Engine Configuration

Key settings in python/src/constants.py:

  • BATCH_SIZE: Documents per processing batch (default: 5000)
  • MAX_SEARCH_RESULTS: Maximum results returned (default: 5)
  • TOTAL_DOCUMENT_COUNT: Total documents for IDF calculation

Crawler Configuration

Configure crawling in UCI-Web-Crawler-master/enhanced_config.ini:

[CRAWLER]
MAX_PAGES = 10000
POLITENESS = 0.5
THREADCOUNT = 4

[PYTHON_INTEGRATION]
ENABLED = true
BATCH_SIZE = 1000
REAL_TIME_INDEXING = false

Performance Metrics

Index Statistics

  • Document processing: ~1000 documents/minute
  • Index size: Typically 10-20% of original content size
  • Query response time: <100ms for most queries
  • Memory usage: ~2GB during indexing, ~500MB during search

Search Quality

  • Precision: High relevance for UCI domain content
  • Recall: Comprehensive coverage of indexed content
  • Duplicate detection: >95% accuracy with SimHash

Testing & Development

Run Tests

cd python
python -m pytest tests/

Debug Mode

# Enable verbose logging
python src/run_queries.py --log_level DEBUG

# Test crawler integration
cd UCI-Web-Crawler-master
python integration_demo.py

Development Tools

  • Logging: Comprehensive logging with configurable levels
  • Statistics: Performance and quality metrics tracking
  • Error handling: Graceful error recovery and reporting

API Reference

Search Endpoint

POST /search
Content-Type: application/json

{
  "query": "information retrieval",
  "search_type": "name"
}

Response:

{
  "results": [
    {
      "url": "https://www.ics.uci.edu/~someurl",
      "content": "Page content..."
    }
  ],
  "time": 45
}

Known Limitations

  • Query Types: Currently supports AND queries only
  • Language: Optimized for English content
  • Domain Scope: Limited to UCI domains (ics, cs, informatics, stat)
  • Rust Implementation: Experimental and incomplete

Future Enhancements

  • OR and phrase queries support
  • Auto-complete suggestions
  • Advanced ranking algorithms (PageRank, BM25)
  • Real-time indexing for live content updates
  • Multi-language support
  • Result clustering and faceted search

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Guidelines

  • Follow existing code style and patterns
  • Add comprehensive error handling and logging
  • Update documentation for new features
  • Test changes thoroughly before submitting

Recent Improvements

Quick Wins Implemented

  • Fixed indexer return values for proper token counting
  • Added comprehensive logging throughout the system
  • Enhanced error handling for file operations
  • Created constants file for better configuration management
  • Fixed typos and improved code quality

Enhanced Crawler

  • Python integration for automatic indexing
  • Better error handling and content quality checks
  • Configurable processing with multiple output formats
  • Real-time and batch processing modes
  • Comprehensive documentation and testing tools

License

This project is part of an academic assignment for Information Retrieval coursework.

Support

For issues and questions:

  1. Check the logs for error details
  2. Review configuration settings
  3. Test individual components separately
  4. Consult the enhanced crawler documentation in UCI-Web-Crawler-master/ENHANCED_README.md

About

Search Engine For the University of California-Irvine webpages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors