-
Notifications
You must be signed in to change notification settings - Fork 123
Open
Labels
enhancementNew feature or requestNew feature or requestpriority: mediumMedium priorityMedium priority
Description
Summary
Provide a separate command-line tool for keyword extraction and term analysis using TF-IDF.
Motivation
TF-IDF is a vectorizer, not a classifier. It transforms text into weighted term vectors. Cramming it into the classifier CLI would be dishonest about what the tool does.
A dedicated keywords tool enables:
- Keyword extraction from documents
- Understanding term importance
- Building vocabularies for other tools
- Document similarity analysis
- Preprocessing pipelines
Proposed CLI
Fit (build vocabulary)
# Build vocabulary from files
keywords fit corpus/*.txt
# From stdin
cat documents.txt | keywords fit
# Custom model path
keywords fit -m vocab.json corpus/*.txtTransform (extract terms)
# Get weighted terms from text
keywords "Ruby is a programming language"
# => ruby:0.52 programming:0.41 language:0.38
# From stdin
echo "some document text" | keywords
# => document:0.45 text:0.42
# Top N terms only
keywords -n 5 "long document with many terms..."
# => term1:0.5 term2:0.4 term3:0.3 term4:0.2 term5:0.1Extract (convenience alias)
# Extract keywords from a file
keywords extract article.txt
# => machine:0.61 learning:0.58 neural:0.45 network:0.42
# From URL (with curl)
curl -s https://example.com/article | keywords extractInfo
keywords info
# => Documents: 1,234
# => Vocabulary: 5,678
# => Min DF: 1
# => Max DF: 1.0Options
-m, --model FILE Model file (default: ./keywords.json)
-n, --top N Show top N terms only
-q Quiet mode (for scripting)
-v, --version Show version
-h, --help Show helpFit-specific options
--min-df N Minimum document frequency (default: 1)
--max-df N Maximum document frequency ratio (default: 1.0)
--ngram MIN,MAX N-gram range (default: 1,1)Examples
# Build vocabulary and extract keywords
keywords fit articles/*.txt
keywords "What are the main topics?"
# => topics:0.6 main:0.4
# Pipeline with classifier
keywords fit corpus.txt
keywords extract article.txt | head -5 # top 5 terms
# Compare documents (output as TSV for scripting)
keywords -q doc1.txt > /tmp/v1.txt
keywords -q doc2.txt > /tmp/v2.txt
# Then use external tool for cosine similarityDesign Principles
- Separate tool: TF-IDF is not classification, don't pretend it is
- Transform is default: No subcommand needed for primary action
- Stdin works: Pipe-friendly
- Scriptable output:
-qfor machine-readable format
Implementation Notes
- Use
optparse(stdlib) - Reuse existing
Classifier::TFIDFclass (when implemented) - Exit codes: 0 success, 1 error, 2 usage error
- Default model:
./keywords.json
Related
- Add CLI executable for training and classification #119 - classifier CLI (separate tool, same principles)
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestpriority: mediumMedium priorityMedium priority