Coradoc is a hub-and-spoke document transformation library for Ruby. It provides a canonical CoreModel that serves as the transformation hub, enabling seamless conversions between AsciiDoc, HTML, Markdown, DOCX, and other formats.
-
Hub-and-Spoke Architecture - CoreModel as canonical representation
-
Format Conversion - Convert between any supported formats
-
Developer API - Simple, intuitive Ruby API
-
Command-Line Interface - CLI for quick conversions
-
Extensibility - Add new formats with minimal code
-
Query API - CSS-like selectors for document querying
-
Validation Framework - Schema-based document validation
-
Streaming Processor - Process large documents efficiently
-
Lazy Evaluation - Memory-efficient lazy document processing
Coradoc uses a hub-and-spoke architecture where all format transformations go through a canonical CoreModel:
┌─────────────────────────────────────┐
│ Source Formats │
│ │
┌─────────┐ │ ┌─────────┐ ┌─────────┐ ┌──────┐ │ ┌─────────┐
│AsciiDoc │────►│ │ ToCore │ │ From │ │ HTML │────►│ HTML │
│ .adoc │ │ │ │ │ Core │ │Render│ │ │ .html │
└─────────┘ │ │ │ │ │ │ │ │ └─────────┘
│ │ ▼ ▼ │ │ │ │
┌─────────┐ │ │ ┌─────────┐ │ │ │ │ ┌─────────┐
│Markdown │────►│ │ │CoreModel│ │ │ │────►│Markdown │
│ .md │ │ │ │ Hub │ │ │ │ │ │ .md │
└─────────┘ │ │ └─────────┘ │ │ │ │ └─────────┘
│ │ │ │ │ │
┌─────────┐ │ │ │ │ │ │ ┌─────────┐
│ HTML │────►│ │ │ │ │────►│ AsciiDoc│
│ .html │ │ └─────────────────────┘ └──────┘ │ │ .adoc │
└─────────┘ │ │ └─────────┘
└──────────────────────────────────────┘
┌─────────┐
│ DOCX │────► ToCoreModel (via Uniword) ──► CoreModel Hub
│ .docx │
└─────────┘
This architecture means: - Adding a new format only requires two transformers (ToCoreModel, FromCoreModel) - N formats can interoperate with just 2N transformers (not N*(N-1)) - The CoreModel provides a canonical, well-defined structure
The CoreModel (Coradoc::CoreModel) is the canonical representation of documents:
StructuralElement
|
Document structure (document, section) |
Block
|
Content blocks (paragraph, code, quote) |
ListBlock
|
Lists (ordered, unordered, definition) |
InlineElement
|
Inline formatting (bold, italic, link) |
Table
|
Tables with rows and cells |
Image
|
Images with alt text |
Add this line to your application’s Gemfile:
gem 'coradoc'
# For DOCX support, also add:
gem 'coradoc-docx'
gem 'uniword' # DOCX readerAnd then execute:
bundle installOr install it yourself as:
gem install coradocrequire 'coradoc'
require 'coradoc/html'
require 'coradoc/markdown'
# Convert Markdown to HTML
html = Coradoc.convert("# Title\n\nParagraph", from: :markdown, to: :html)
# Parse to CoreModel
core = Coradoc.parse("# Title\n\nParagraph", format: :markdown)
# Serialize to any format
markdown = Coradoc.serialize(core, to: :markdown)
html = Coradoc.serialize(core, to: :html)Convert Word documents to AsciiDoc or Markdown:
require 'coradoc'
require 'coradoc/docx'
# Convert DOCX to AsciiDoc
adoc = Coradoc.convert("input.docx", from: :docx, to: :asciidoc)
# Convert DOCX to Markdown
md = Coradoc.convert("input.docx", from: :docx, to: :markdown)
# Parse DOCX to CoreModel for manipulation
core = Coradoc.parse("input.docx", format: :docx)
core.title # => "Document Title"
core.children # => Array of sections, paragraphs, tables, etc.
# Serialize to different formats
adoc = Coradoc.serialize(core, to: :asciidoc)
md = Coradoc.serialize(core, to: :markdown)The Coradoc.convert method handles the complete transformation pipeline:
# AsciiDoc to HTML
html = Coradoc.convert(adoc_text, from: :asciidoc, to: :html)
# Markdown to HTML
html = Coradoc.convert(md_text, from: :markdown, to: :html)
# HTML to Markdown
md = Coradoc.convert(html_text, from: :html, to: :markdown)
# DOCX to AsciiDoc (requires coradoc-docx gem)
adoc = Coradoc.convert("document.docx", from: :docx, to: :asciidoc)
# DOCX to Markdown
md = Coradoc.convert("document.docx", from: :docx, to: :markdown)Parse documents to CoreModel for manipulation:
# Parse Markdown
core = Coradoc.parse("# Title\n\nContent", format: :markdown)
# Access the structure
core.element_type # => "document"
core.title # => "Title"
core.children # => Array of child elementsSerialize CoreModel to any supported format:
# Create or modify CoreModel
core = Coradoc::CoreModel::StructuralElement.new(
element_type: "document",
title: "My Document",
children: [...]
)
# Serialize to HTML
html = Coradoc.serialize(core, to: :html)
# Serialize to Markdown
md = Coradoc.serialize(core, to: :markdown)The coradoc command-line tool provides quick conversions:
# Basic conversion
coradoc convert input.md -o output.html
# Specify formats explicitly
coradoc convert input.md --from markdown --to html
# Convert DOCX to AsciiDoc (requires coradoc-docx gem)
coradoc convert document.docx -o output.adoc
# Convert DOCX to Markdown
coradoc convert document.docx -o output.md
# Use different HTML themes
coradoc convert input.md -o output.html --theme modern
# Verbose output
coradoc convert input.md -o output.html --verbose
# Show supported formats
coradoc formatsTo add a new format, create a gem with:
-
Format module with parse/serialize methods
-
ToCoreModel transformer - converts native model to CoreModel
-
FromCoreModel transformer - converts CoreModel to native model
-
Register with Coradoc
# lib/coradoc/my_format.rb
module Coradoc
module MyFormat
# Parse input to native model
def self.parse(content)
# ...
end
# Parse directly to CoreModel
def self.parse_to_core(content)
Transform::ToCoreModel.transform(parse(content))
end
# Transform native model to CoreModel
def self.to_core(model)
Transform::ToCoreModel.transform(model)
end
# Transform CoreModel to native model
def self.from_core(core)
Transform::FromCoreModel.transform(core)
end
# Serialize CoreModel to output
def self.serialize(core, **options)
model = from_core(core)
serialize_native(model)
end
end
end
# Register the format
Coradoc.register_format(:my_format, Coradoc::MyFormat,
extensions: ['.myf', '.myformat'])# Document
doc = Coradoc::CoreModel::StructuralElement.new(
element_type: "document",
title: "My Document",
children: [...]
)
# Section
section = Coradoc::CoreModel::StructuralElement.new(
element_type: "section",
level: 1,
title: "Section Title",
children: [...]
)# Paragraph
para = Coradoc::CoreModel::Block.new(
element_type: "paragraph",
content: "Paragraph text"
)
# Code block
code = Coradoc::CoreModel::Block.new(
element_type: "block",
delimiter_type: "----",
content: "def hello; puts 'world'; end",
language: "ruby"
)# Unordered list
list = Coradoc::CoreModel::ListBlock.new(
marker_type: "unordered",
items: [
Coradoc::CoreModel::ListItem.new(content: "Item 1", marker: "*"),
Coradoc::CoreModel::ListItem.new(content: "Item 2", marker: "*"),
]
)
# Definition list
def_list = Coradoc::CoreModel::DefinitionList.new(
items: [
Coradoc::CoreModel::DefinitionItem.new(
term: "API",
definitions: ["Application Programming Interface"]
),
]
)# Bold
bold = Coradoc::CoreModel::InlineElement.new(
format_type: "bold",
content: "bold text"
)
# Link
link = Coradoc::CoreModel::InlineElement.new(
format_type: "link",
target: "https://example.com",
content: "Example"
)
# STEM formula
stem = Coradoc::CoreModel::InlineElement.new(
format_type: "stem",
content: "E = mc^2",
stem_type: "stem"
)Supported inline format types:
bold
|
Bold text |
italic
|
Italic/emphasized text |
monospace
|
Code/monospace text |
link
|
Hyperlinks |
xref
|
Cross-references |
stem
|
STEM formulas (mathematical notation) |
footnote
|
Footnotes |
term
|
Term references (glossary terms) |
superscript
|
Superscript text |
subscript
|
Subscript text |
Query documents using CSS-like selectors:
# Parse document
doc = Coradoc.parse(adoc_text, format: :asciidoc)
# Find all sections
sections = doc.query('section')
# Find level-2 sections
doc.query('section.level-2').each do |section|
puts section.title
end
# Find paragraphs with specific role
examples = doc.query('[role=example]')
# Complex selectors with pseudo-classes
doc.query('section > paragraph:first-child')
# Query within a specific element
doc.query_within(section, 'paragraph')
# Chain queries
doc.query('section').filter('.important').firstValidate documents against schemas:
# Define a validation schema
schema = Coradoc::Validation::Schema.define do
required :title, type: String, min_length: 1
required :sections, type: Array, min_count: 1
optional :author, type: String
rule :check_references do |doc|
refs = doc.query('xref')
missing = refs.reject { |r| doc.resolve_reference(r) }
missing.map { |r| "Unresolved reference: #{r.target}" }
end
end
# Validate a document
result = schema.validate(document)
if result.valid?
puts "Document is valid"
else
result.errors.each { |e| puts "#{e.path}: #{e.message}" }
endProcess large documents without loading everything into memory:
# Stream parse large file
Coradoc::Streaming.parse_large_file("large.adoc", format: :asciidoc,
chunk_size: 100) do |chunk|
chunk.each { |element| process_element(element) }
end
# Transform in chunks
results = Coradoc::Streaming.transform_in_chunks(elements, chunk_size: 50) do |chunk|
chunk.map { |el| transform_element(el) }
end
# Incremental serialization
File.open("output.html", "w") do |file|
Coradoc::Streaming.serialize_incremental(document, format: :html) do |chunk|
file.write(chunk)
end
end
# Process with memory constraints
progress = Coradoc::Streaming.process_with_memory_limit(
"input.adoc", "output.html",
format: :asciidoc, output_format: :html,
max_memory: 50 * 1024 * 1024 # 50MB
)
puts progress.to_s # "100 processed (100.0%) at 10.0/sec ~0.5min remaining"Memory-efficient processing using lazy enumerators and on-demand evaluation:
# Wrap document for lazy iteration
wrapper = Coradoc::Lazy.wrap(document)
wrapper.each_section do |section|
process_section(section) # Processed on-demand
end
# Lazy transformation pipeline
result = Coradoc::Lazy.transform(sections) do |p|
p.map { |s| transform_section(s) }
.select { |s| s.visible? }
.take(10)
end.to_a # Only evaluates when to_a is called
# Process in batches
wrapper.each_batch(10) do |batch|
batch.each { |section| process(section) }
end
# Lazy reference resolution
resolver = Coradoc::Lazy.resolver(document, loader: ->(ref, _) {
load_include_file(ref)
})
content = resolver.resolve("include::chapter1.adoc[]")# Run all tests
bundle exec rspec
# Run specific test file
bundle exec rspec spec/coradoc/developer_experience_spec.rb
# Run with documentation
bundle exec rspec --format documentationcoradoc/
├── lib/
│ └── coradoc/
│ ├── coradoc.rb # Main API (parse, convert, serialize)
│ ├── registry.rb # Format registry
│ ├── core_model/ # CoreModel classes
│ ├── transform/ # Base transformer
│ ├── query.rb # Document query API
│ ├── validation.rb # Document validation
│ ├── streaming.rb # Large document processing
│ ├── hooks.rb # Plugin lifecycle hooks
│ ├── extensions.rb # Custom element extensions
│ └── cli.rb # CLI implementation
├── coradoc-adoc/ # AsciiDoc format gem
├── coradoc-docx/ # DOCX format gem (OOXML → CoreModel via Uniword)
├── coradoc-html/ # HTML format gem
├── coradoc-markdown/ # Markdown format gem
├── spec/ # Test files
└── exe/
└── coradoc # CLI executable
-
Fork the repository
-
Create your feature branch (
git checkout -b feature/amazing-feature) -
Commit your changes (
git commit -am 'Add amazing feature') -
Push to the branch (
git push origin feature/amazing-feature) -
Open a Pull Request
- Copyright
-
2024-2026 Ribose Inc.
Licensed under the Apache License, Version 2.0.