A comprehensive search engine implementation with web crawling, indexing, and retrieval capabilities for UCI domain content.
This project implements a full-stack search engine solution that includes:
- Web Crawler: Enhanced crawler for UCI domains with Python integration
- Document Indexing: Efficient inverted index construction with TF-IDF scoring
- Search Interface: FastAPI-based REST API with React frontend
- Multi-language Support: Python (primary) and Rust (experimental) implementations
UCI-Search-Engine/
├── python/ # Python implementation (primary)
│ ├── src/ # Source code
│ │ ├── main.py # Index builder
│ │ ├── indexer.py # Inverted index implementation
│ │ ├── tokenizer.py # Text tokenization and stemming
│ │ ├── query.py # Search and retrieval
│ │ ├── similarity.py # Duplicate detection
│ │ └── run_queries.py # API server
│ └── requirements.txt # Python dependencies
├── client/ # React frontend
│ ├── src/ # Frontend source
│ └── package.json # Node dependencies
├── rust/ # Rust implementation (experimental)
│ ├── src/ # Rust source code
│ └── Cargo.toml # Rust dependencies
└── UCI-Web-Crawler-master/ # Enhanced web crawler
├── enhanced_launch.py # Enhanced crawler launcher
├── python_connector.py # Python integration
└── ENHANCED_README.md # Crawler documentation
- Python 3.8+
- Node.js 16+ (for frontend)
- Rust 1.70+ (optional, for Rust implementation)
-
Install Python dependencies:
cd python pip install -r requirements.txt -
Download NLTK data (first time only):
import nltk nltk.download('punkt') nltk.download('stopwords')
- Install Node.js dependencies:
cd client npm install
- Build Rust implementation:
cd rust cargo build --release
cd python
python src/main.pyThis processes documents in batches and creates partial indexes in the indexes/ directory.
python src/lazy_merger.pyThis combines all partial indexes into merged indexes organized by letter.
cd python/src
python run_queries.pyThen open your browser to http://localhost:3000 for the web interface.
# Terminal 1: Start API server
cd python/src
python run_queries.py
# Terminal 2: Start frontend dev server
cd client
npm run devFrontend will be available at http://localhost:5173.
To crawl new content and integrate with the search engine:
cd UCI-Web-Crawler-master
# Basic crawling
python enhanced_launch.py
# With Python search integration
python enhanced_launch.py --python_integration
# Test integration
python integration_demo.py- Boolean AND queries: All terms must be present in results
- TF-IDF scoring: Documents ranked by relevance
- Stemming: Handles word variations (running → run)
- Stop word filtering: Removes common words for better relevance
- Fast retrieval: Optimized inverted index structure
- Duplicate detection: SimHash-based similarity detection
- Content filtering: Size and quality thresholds
- Domain validation: UCI-specific domain filtering
- Robots.txt compliance: Respects crawler restrictions
- Batch processing: Efficient memory usage during indexing
- Lazy merging: Memory-efficient index combination
- Concurrent crawling: Multi-threaded web crawler
- Caching: Response caching for better performance
Key settings in python/src/constants.py:
BATCH_SIZE: Documents per processing batch (default: 5000)MAX_SEARCH_RESULTS: Maximum results returned (default: 5)TOTAL_DOCUMENT_COUNT: Total documents for IDF calculation
Configure crawling in UCI-Web-Crawler-master/enhanced_config.ini:
[CRAWLER]
MAX_PAGES = 10000
POLITENESS = 0.5
THREADCOUNT = 4
[PYTHON_INTEGRATION]
ENABLED = true
BATCH_SIZE = 1000
REAL_TIME_INDEXING = false- Document processing: ~1000 documents/minute
- Index size: Typically 10-20% of original content size
- Query response time: <100ms for most queries
- Memory usage: ~2GB during indexing, ~500MB during search
- Precision: High relevance for UCI domain content
- Recall: Comprehensive coverage of indexed content
- Duplicate detection: >95% accuracy with SimHash
cd python
python -m pytest tests/# Enable verbose logging
python src/run_queries.py --log_level DEBUG
# Test crawler integration
cd UCI-Web-Crawler-master
python integration_demo.py- Logging: Comprehensive logging with configurable levels
- Statistics: Performance and quality metrics tracking
- Error handling: Graceful error recovery and reporting
POST /search
Content-Type: application/json
{
"query": "information retrieval",
"search_type": "name"
}Response:
{
"results": [
{
"url": "https://www.ics.uci.edu/~someurl",
"content": "Page content..."
}
],
"time": 45
}- Query Types: Currently supports AND queries only
- Language: Optimized for English content
- Domain Scope: Limited to UCI domains (ics, cs, informatics, stat)
- Rust Implementation: Experimental and incomplete
- OR and phrase queries support
- Auto-complete suggestions
- Advanced ranking algorithms (PageRank, BM25)
- Real-time indexing for live content updates
- Multi-language support
- Result clustering and faceted search
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow existing code style and patterns
- Add comprehensive error handling and logging
- Update documentation for new features
- Test changes thoroughly before submitting
- Fixed indexer return values for proper token counting
- Added comprehensive logging throughout the system
- Enhanced error handling for file operations
- Created constants file for better configuration management
- Fixed typos and improved code quality
- Python integration for automatic indexing
- Better error handling and content quality checks
- Configurable processing with multiple output formats
- Real-time and batch processing modes
- Comprehensive documentation and testing tools
This project is part of an academic assignment for Information Retrieval coursework.
For issues and questions:
- Check the logs for error details
- Review configuration settings
- Test individual components separately
- Consult the enhanced crawler documentation in
UCI-Web-Crawler-master/ENHANCED_README.md