Search Engine Overview

Our search engine provides fast and accurate search capabilities across UCI's ICS domain. It combines modern information retrieval techniques with efficient data structures to deliver results in milliseconds.

Performance Metrics

Query Type	Response Time
Single term	10-100ms
Multi-term	100-200ms
Complex (5+ terms)	200-300ms

Core Architecture

The system is built on four main components working in harmony:

Document Processing Pipeline

Component	Function
HTML Parser	Extracts clean text from web pages using BeautifulSoup4
Text Analyzer	Identifies important content from headers and titles
Duplicate Detector	Prevents index bloat using SimHash algorithm

Search Algorithm

Feature	Description
TF-IDF Scoring	Measures term importance in documents
Cosine Similarity	Computes relevance between query and documents
PageRank & HITS	Incorporates web graph authority signals

Index Management

Strategy	Implementation
Storage	Hybrid Pickle/JSON for optimal speed/space tradeoff
Access	Peek-based retrieval to minimize memory usage
Caching	LRU cache for frequent terms and queries

Query Processing

Stage	Operation
Tokenization	NLTK-based text normalization
Stemming	Porter stemming for word variations
Ranking	Multi-factor score combining relevance signals

Technical Implementation

The codebase is organized into focused modules:

search.py: Core search logic and ranking
indexer.py: Document processing and index building
token_processor.py: Text analysis and normalization
document_processor.py: HTML handling and deduplication

Data Structures

Document

@dataclass
class Document:
    url: str                    # Document URL
    content: str                # Processed raw text content
    doc_id: int                 # Unique document identifier
    simhash: str                # SimHash fingerprint for deduplication
    token_count: int            # Number of tokens in document
    outgoing_links: List[str]   # Outgoing URLs for link analysis

Posting

@dataclass
class Posting:
    doc_id: int            # Document identifier
    frequency: int         # Term frequency in document
    importance: float      # Combined weight from HTML tags
    tf_idf: float          # Term frequency-inverse document frequency score
    positions: List[int]   # Token positions for phrase queries

Index Structure

{
    "term1": [Posting1, Posting2, ...],
    "term2": [Posting3, Posting4, ...],
    ...
}

Usage

Build the index:

python3 indexer.py

Start the search engine:

# For UI
streamlit run main.py

# For CLI
python3 search.py

Requirements

Python 3.7+
Streamlit
NLTK
BeautifulSoup4
NumPy
SciPy
scikit-learn

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
components		components
full_analytics		full_analytics
utils		utils
.gitignore		.gitignore
CS121 - A3 M1.pdf		CS121 - A3 M1.pdf
CS121 - A3 M2.pdf		CS121 - A3 M2.pdf
CS121 - A3 M3.pdf		CS121 - A3 M3.pdf
README.md		README.md
indexer.py		indexer.py
main.py		main.py
multithread_indexer.py		multithread_indexer.py
requirements.txt		requirements.txt
search.py		search.py
todo.md		todo.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search Engine Overview

Performance Metrics

Core Architecture

Technical Implementation

Data Structures

Document

Posting

Index Structure

Usage

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Search Engine Overview

Performance Metrics

Core Architecture

Technical Implementation

Data Structures

Document

Posting

Index Structure

Usage

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages