Our search engine provides fast and accurate search capabilities across UCI's ICS domain. It combines modern information retrieval techniques with efficient data structures to deliver results in milliseconds.
| Query Type |
Response Time |
| Single term |
10-100ms |
| Multi-term |
100-200ms |
| Complex (5+ terms) |
200-300ms |
The system is built on four main components working in harmony:
- Document Processing Pipeline
| Component |
Function |
| HTML Parser |
Extracts clean text from web pages using BeautifulSoup4 |
| Text Analyzer |
Identifies important content from headers and titles |
| Duplicate Detector |
Prevents index bloat using SimHash algorithm |
- Search Algorithm
| Feature |
Description |
| TF-IDF Scoring |
Measures term importance in documents |
| Cosine Similarity |
Computes relevance between query and documents |
| PageRank & HITS |
Incorporates web graph authority signals |
- Index Management
| Strategy |
Implementation |
| Storage |
Hybrid Pickle/JSON for optimal speed/space tradeoff |
| Access |
Peek-based retrieval to minimize memory usage |
| Caching |
LRU cache for frequent terms and queries |
- Query Processing
| Stage |
Operation |
| Tokenization |
NLTK-based text normalization |
| Stemming |
Porter stemming for word variations |
| Ranking |
Multi-factor score combining relevance signals |
The codebase is organized into focused modules:
@dataclass
class Document:
url: str # Document URL
content: str # Processed raw text content
doc_id: int # Unique document identifier
simhash: str # SimHash fingerprint for deduplication
token_count: int # Number of tokens in document
outgoing_links: List[str] # Outgoing URLs for link analysis
@dataclass
class Posting:
doc_id: int # Document identifier
frequency: int # Term frequency in document
importance: float # Combined weight from HTML tags
tf_idf: float # Term frequency-inverse document frequency score
positions: List[int] # Token positions for phrase queries
{
"term1": [Posting1, Posting2, ...],
"term2": [Posting3, Posting4, ...],
...
}
- Build the index:
- Start the search engine:
# For UI
streamlit run main.py
# For CLI
python3 search.py
- Python 3.7+
- Streamlit
- NLTK
- BeautifulSoup4
- NumPy
- SciPy
- scikit-learn