Skip to content

Add keywords CLI tool for text vectorization #122

@cardmagic

Description

@cardmagic

Summary

Provide a separate command-line tool for keyword extraction and term analysis using TF-IDF.

Motivation

TF-IDF is a vectorizer, not a classifier. It transforms text into weighted term vectors. Cramming it into the classifier CLI would be dishonest about what the tool does.

A dedicated keywords tool enables:

  • Keyword extraction from documents
  • Understanding term importance
  • Building vocabularies for other tools
  • Document similarity analysis
  • Preprocessing pipelines

Proposed CLI

Fit (build vocabulary)

# Build vocabulary from files
keywords fit corpus/*.txt

# From stdin
cat documents.txt | keywords fit

# Custom model path
keywords fit -m vocab.json corpus/*.txt

Transform (extract terms)

# Get weighted terms from text
keywords "Ruby is a programming language"
# => ruby:0.52 programming:0.41 language:0.38

# From stdin
echo "some document text" | keywords
# => document:0.45 text:0.42

# Top N terms only
keywords -n 5 "long document with many terms..."
# => term1:0.5 term2:0.4 term3:0.3 term4:0.2 term5:0.1

Extract (convenience alias)

# Extract keywords from a file
keywords extract article.txt
# => machine:0.61 learning:0.58 neural:0.45 network:0.42

# From URL (with curl)
curl -s https://example.com/article | keywords extract

Info

keywords info
# => Documents: 1,234
# => Vocabulary: 5,678
# => Min DF: 1
# => Max DF: 1.0

Options

-m, --model FILE    Model file (default: ./keywords.json)
-n, --top N         Show top N terms only
-q                  Quiet mode (for scripting)
-v, --version       Show version
-h, --help          Show help

Fit-specific options

--min-df N          Minimum document frequency (default: 1)
--max-df N          Maximum document frequency ratio (default: 1.0)
--ngram MIN,MAX     N-gram range (default: 1,1)

Examples

# Build vocabulary and extract keywords
keywords fit articles/*.txt
keywords "What are the main topics?"
# => topics:0.6 main:0.4

# Pipeline with classifier
keywords fit corpus.txt
keywords extract article.txt | head -5  # top 5 terms

# Compare documents (output as TSV for scripting)
keywords -q doc1.txt > /tmp/v1.txt
keywords -q doc2.txt > /tmp/v2.txt
# Then use external tool for cosine similarity

Design Principles

  1. Separate tool: TF-IDF is not classification, don't pretend it is
  2. Transform is default: No subcommand needed for primary action
  3. Stdin works: Pipe-friendly
  4. Scriptable output: -q for machine-readable format

Implementation Notes

  • Use optparse (stdlib)
  • Reuse existing Classifier::TFIDF class (when implemented)
  • Exit codes: 0 success, 1 error, 2 usage error
  • Default model: ./keywords.json

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions