Metadata 1 #16

toschoosqd · 2025-12-17T13:35:06Z

First iteration for metadata with some preliminary ideas.
Highly relevant is #3

Updated metadata documentation with corrections and clarifications.

Updated terminology and corrected size units in metadata documentation. Added sections on conversions and statistics, and improved clarity in various explanations.

Expanded the metadata document to include detailed discussions on metadata purpose, goals, and formats, along with specific short-term and long-term objectives.

Revised the metadata document to improve clarity and consistency in language, including updates to the purpose, schema definitions, and type system descriptions.

Corrected spelling of 'modelling' to 'modeling' throughout the document.

sql-research/2025-12-16-metadata.md

Added comment to clarify hash type options.

Updated metadata document to improve clarity and consistency in terminology, including changes to key definitions and type representations.

Corrected a typo in the metadata documentation regarding chunk summary.

Add comment regarding key range sharding and chunk handling.

dzhelezov · 2026-01-21T10:24:39Z

One of the key architectural decisions to be made is if we put the dataset properties (that, is the properties of the data itself) into the metadata, or leave it purely schema oriented. Introducing the statistics already suggests that we want to be data aware here. Then, we should also think where we store:

the data location
the data updates history (git-like?)

A previous attempt to design such a location aware and update/edit-aware metadata file was made in this issue: https://github.com/subsquid/datas3ts/issues/5

There the design was built around the following properties:

the schema is immutable and the immutable part of the metadata should be self-sertified
any dataset snapshot can be identified by a single hash (so that it can be published on-chain or elsewhere, and it will self-certify the full data in there, similar to how merkle trees work)
the locations of the dataset files are part of the dataset descriptions but can be updated at any time, so that the workers may download the nessary chunks from multiple locations (eg ipfs, s3, etc)
data appends are efficient and don't require a full re-calculation of the dataset hashes, only incremental updates

Some extra thought and care should be taken in order to not trigger expensive list operations -- similar how we currently do that by placing the files in a tree-like directory structure in s3 buckets

define-null

I did a first pass on this draft, so left a bunch of comments and questions.

define-null · 2026-01-22T10:00:35Z

sql-research/2025-12-16-metadata.md

+- a **modeling facility** to define generic data in the first place. 
+
+We are currently working with known and - to some extent - homogeneous data.
+Generic interoperability is significantly harder.


I'd like to better understand the generic interoperability and generic data implications here.

Are we talking about the raw data for such generic datasets or is it a structured data?

Do we intend to guaranty atomicity and isolation of the ingestion?

define-null · 2026-01-22T10:01:44Z

sql-research/2025-12-16-metadata.md

+
+- Data Generation
+
+- Integration with standard tools


Can you elaborate - which type of tools?

define-null · 2026-01-22T10:12:36Z

sql-research/2025-12-16-metadata.md

+- Hand-crafted ingestion pipelines
+
+- Validation and Parsing of data for different components
+(portals, workers, SDKs, the DuckDB extension).


One of the important questions that is not clear to me from the draft - wherever the intent is to implement schema-on-read or schema-on-write architecture? In the former case we are talking about the traditional data lakes with raw/semi unstructured data, with rather limited correctness check and schema applied when running the query (with less integrity constraints that are enforced on read). In the later we are aiming for the more strict correctness (integrity constraints enforced on write) and consistency.

From that perspective I'm not sure if https://github.com/subsquid/specs/pull/3/changes is assumed in this document or not.

define-null · 2026-01-22T10:19:47Z

sql-research/2025-12-16-metadata.md

+In the future, we may add statistics to columns or row groups
+to accelerate ingestion and, in particular, retrieval.


If we aim for analytical use cases statistics would be essential, as in such systems the common pattern is to use the pruning techniques to further reduce the subset of data involved in query execution. So in my view we should priorities column and row statistics such as min/max, cardinality, null counts, etc from the start

define-null · 2026-01-22T10:30:26Z

sql-research/2025-12-16-metadata.md

+
+Types are distinguished into **primitive types** and **complex types**.
+
+Primitive types are defined in terms of


When it comes to types in my view it's important to consider several factors:

what is the minimum subset of types that we need for a POC?

what is the compatibility story with other existing dbs and engines?

what are the conversion rules that we would like to have for those types?

define-null · 2026-01-22T10:53:47Z

sql-research/2025-12-16-metadata.md

+
+In other words, chunks remain tied to ingestion time, whereas keys may not.
+The natural partitioning therefore maps **key ranges to ingestion time ranges**.
+Key-range shards should be defined by users: they know their data size, ingestion speed and data skew.


Perhaps we can go with the hybrid strategy, where the sorting criteria is provided by the user, while sharding happens automatically? While users commonly better understand the shape of their data, they might have less visibility and understanding what the efficient sharding strategies would be.

define-null · 2026-01-22T10:56:19Z

sql-research/2025-12-16-metadata.md

+
+- B+Tree.
+
+### Real-Time Data


I'm not quite sure I understand what the realt-time ingestion means in this context. Could you elaborate with some example what is the user use-case that we are considering here? I would expect the batch ingestion which is more preferred for the efficiency and thus not real-time.

define-null · 2026-01-22T10:59:45Z

sql-research/2025-12-16-metadata.md

+Schemas shall also include elements for defining real-time data.
+This may include an endpoint from which data is read,
+and a stored procedure (or equivalent processing step)
+that transforms data and passes it on to an internal API.


Just to confirm that I understand you correctly - are we talking about ETL pipeline here with possibility to specify the transformation part?

define-null · 2026-01-22T13:37:24Z

sql-research/2025-12-16-metadata.md

+In the future, we may add statistics to columns or row groups
+to accelerate ingestion and, in particular, retrieval.
+
+Integrity Constraints are


It would be great to be a bit more specific here, wherever it is suggested to enforce those constraints or not and if yes - at which phase (read/write). For example enforcing FK constraint would add significant ingestion overhead and complexity. Similarly uniqueness constraint it not typically enforced on the OLAP systems to my knowledge.

define-null · 2026-01-22T13:49:30Z

sql-research/2025-12-16-metadata.md

+(maps, assignments, chunks, etc.) in memory in a single portal. We will need
+to **shard datasets across portals**.
+
+We may also want to explore other kinds of indices, for example:


I would suggest we prioritize bitmap and zone indexes. B+tree and Radix tree are better fit for the OLTP workloads.

toschoosqd added 6 commits December 16, 2025 16:55

[metadata-1] doc started

7c3d479

Fix formatting and clarify metadata definitions

88837b3

Updated metadata documentation with corrections and clarifications.

Revise metadata documentation for accuracy and clarity

d11347e

Updated terminology and corrected size units in metadata documentation. Added sections on conversions and statistics, and improved clarity in various explanations.

Enhance metadata document with detailed discussions

60873ed

Expanded the metadata document to include detailed discussions on metadata purpose, goals, and formats, along with specific short-term and long-term objectives.

Refine language and structure in metadata document

f840ff7

Revised the metadata document to improve clarity and consistency in language, including updates to the purpose, schema definitions, and type system descriptions.

Fix spelling of 'modeling' in metadata document

30c01c4

Corrected spelling of 'modelling' to 'modeling' throughout the document.

toschoosqd requested review from denisbsu, dzhelezov and kalabukdima December 17, 2025 13:35

solana.sql example added

01128f3

denisbsu reviewed Dec 18, 2025

View reviewed changes

sql-research/2025-12-16-metadata.md Outdated Show resolved Hide resolved

toschoosqd added 3 commits December 18, 2025 09:00

Clarify hash type options in solana.sql

82250f2

Added comment to clarify hash type options.

Revise metadata for clarity and consistency

dbe4b8b

Updated metadata document to improve clarity and consistency in terminology, including changes to key definitions and type representations.

Update primary key comments in SQL tables

51222f0

toschoosqd requested a review from define-null January 20, 2026 10:40

toschoosqd added 2 commits January 21, 2026 09:41

Fix typo in ingestion timestamp explanation

aae50ef

Corrected a typo in the metadata documentation regarding chunk summary.

Update metadata with comment on key range sharding

5ac9180

Add comment regarding key range sharding and chunk handling.

define-null reviewed Jan 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata 1 #16

Metadata 1 #16

toschoosqd commented Dec 17, 2025

Uh oh!

Uh oh!

dzhelezov commented Jan 21, 2026

Uh oh!

define-null left a comment

Uh oh!

define-null Jan 22, 2026

Uh oh!

define-null Jan 22, 2026

Uh oh!

define-null Jan 22, 2026

Uh oh!

define-null Jan 22, 2026

Uh oh!

define-null Jan 22, 2026

Uh oh!

define-null Jan 22, 2026

Uh oh!

define-null Jan 22, 2026

Uh oh!

define-null Jan 22, 2026

Uh oh!

define-null Jan 22, 2026

Uh oh!

define-null Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		In the future, we may add statistics to columns or row groups
		to accelerate ingestion and, in particular, retrieval.


		Types are distinguished into primitive types and complex types.

		Primitive types are defined in terms of

Metadata 1 #16

Are you sure you want to change the base?

Metadata 1 #16

Conversation

toschoosqd commented Dec 17, 2025

Uh oh!

Uh oh!

dzhelezov commented Jan 21, 2026

Uh oh!

define-null left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants