Feature: User-Controlled STAC Collection IDs
Background
Currently, STAC collection IDs are auto-assigned by DpsStacItemGenerator using a
deterministic formula derived from the DPS job's .met.json metadata file:
{username}__{algorithm_name}__{algorithm_version}__{tag}
This value is slugified (special characters replaced) and then unconditionally written
into the collection field of every STAC item before publishing to the ingestor queue —
regardless of what collection the user's catalog.json specifies. Users have requested
the ability to control the collection ID so that outputs from related jobs and algorithm
runs can be organized into a single, meaningfully named collection.
This ticket proposes an initial implementation using admin-mediated collection creation,
designed to extend toward self-service and algorithm-level authorization.
How the current pipeline works
DpsStacItemGenerator (link) is triggered by S3 event notifications when a DPS job writes a
catalog.json to the output bucket. For each event:
- The DPS output prefix is extracted from the S3 key path using a timestamp pattern
- A
.met.json file is loaded from that prefix — this is the authoritative source of
job context, containing at minimum: username, algorithm_name,
algorithm_version, and tag
- A deterministic collection ID is constructed from those fields and slugified
- The
catalog.json is read via pystac; every item's collection field is
overwritten with the deterministic ID before publishing to the ingestor SNS topic
Some users are already setting the collection field in their STAC items, but the
current code silently overwrites it. This feature stops that overwrite and makes the
item-provided collection ID the primary routing mechanism.
Proposed Approach
Phase 1: Admin-Mediated Collection Creation (this ticket)
Users request a named collection via an out-of-band process (GitHub issue or intake
form). An admin creates the collection in pgSTAC using a new CLI tool, setting the
requesting user as owner. Users then set the collection field in their algorithm's
STAC item outputs; DpsStacItemGenerator will respect it if authorization passes.
Metadata storage design
There are two distinct categories of collection metadata introduced by this feature,
and they warrant different storage strategies:
Access control (who may write, which algorithms are approved) is governance data.
It should not appear in public STAC API responses and has no place in the STAC spec.
This belongs in pgstac's private JSONB column, which is purpose-built for backend
metadata that is never surfaced to API consumers:
// pgstac.collections.private — never returned by the STAC API
{
"owner": "jsmith",
"contributors": ["kwilliams"],
"approved_algorithms": [
{"name": "my-flood-detector", "version": "1.2.0"},
{"name": "my-flood-detector", "version": "1.3.0"}
]
}
Algorithm provenance (which DPS algorithms have contributed data to this
collection) is catalog metadata. It is legitimately useful to anyone browsing the
catalog and belongs in the public STAC collection document. Rather than overloading the
standard providers field — which is designed for data lineage attribution, not
pipeline tracking — this is a natural fit for a MAAP DPS STAC extension:
// pgstac.collections.content — returned by the STAC API
"stac_extensions": [
"https://maap-project.org/stac/extensions/dps/v1.0.0/schema.json"
],
"maap_dps:contributing_algorithms": [
{"name": "my-flood-detector", "version": "1.2.0"},
{"name": "my-flood-detector", "version": "1.3.0"}
]
This separation keeps providers clean for its intended provenance purpose, keeps
access control private, and gives MAAP DPS pipeline metadata a coherent versioned home
that can grow (e.g., DPS environment info, platform version) without polluting the
base STAC document. The extension schema would be maintained in this repository.
Note that approved_algorithms (private, access control) and
maap_dps:contributing_algorithms (public, provenance) are related but distinct: the
former defines what is permitted, the latter records what has actually run. They are
updated at different times and by different actors.
Authorization logic in DpsStacItemGenerator
When a catalog event is received, the Lambda reads job context from .met.json as
today, then resolves the target collection ID from the STAC items themselves before
applying authorization:
met.json fields available: username, algorithm_name, algorithm_version, tag
item["collection"] field: user-specified collection ID (optional)
If items do not specify a collection field (or specify one that is absent from
pgSTAC):
→ Fall back to the existing deterministic ID formula. Emit a structured warning to
CloudWatch if a collection was specified but not found in pgstac.
If the catalog specifies a collection ID that exists in pgstac:
| Check |
Source |
Pass |
Fail |
username is owner or contributor |
private column |
Continue |
Hard failure — do not fall back silently |
algorithm_name + algorithm_version approved (if list non-empty) |
private column |
Continue |
Hard failure — emit structured error |
| Both pass |
— |
Publish items with user-specified collection ID; update maap_dps:contributing_algorithms |
— |
The distinction between "collection not found" (soft fallback) and "collection found
but unauthorized" (hard failure) is intentional. A missing collection plausibly means
the user hasn't requested creation yet. An authorization failure means an explicit
policy was violated and silently redirecting items elsewhere would be a governance
failure.
Database query
Authorization is resolved by querying pgstac directly via the existing pgBouncer
connection (VPC-accessible EC2), avoiding HTTP overhead through the STAC API:
SELECT private
FROM pgstac.collections
WHERE id = $1
This is a single query per catalog event regardless of item count. pgBouncer in
transaction pooling mode handles concurrent Lambda invocations efficiently. On
successful authorization, a second update appends the contributing algorithm to the
public content if not already present.
Admin CLI
A lightweight wrapper around pypgstac that:
- Validates the proposed collection ID against naming rules and the reserved-name blocklist
- Writes the
private column with owner, contributors, and optional approved algorithm list
- Initializes the
maap_dps extension fields in the public collection document
This is the highest-leverage piece of initial work. Without it, the catalog will
accumulate collections with inconsistent ownership and algorithm approval records.
Backfill
Existing auto-assigned collections will have private ownership records backfilled
using the username already embedded in their deterministic IDs. The algorithm_name
and algorithm_version components are also present in the deterministic ID, so
approved_algorithms and maap_dps:contributing_algorithms can be backfilled as well.
This allows the authorization check in DpsStacItemGenerator to go live without a
legacy carve-out for existing collections.
Phase 2: Self-Service Collection Creation (future)
The Phase 1 design is structured so that self-service collection creation slots in
without changing the DpsStacItemGenerator authorization logic. The Lambda already
handles all authorization outcomes. Future work will add a collection request UI or
API, synchronous ID reservation for race condition handling, and algorithm approval
management for collection owners.
Algorithm Authoring Convention
The collection ID should be treated as a runtime parameter, not a hardcoded value
inside algorithm code. DPS supports arbitrary named input parameters, and algorithms
should declare a collection_id input parameter that is passed through to the STAC
item outputs at job runtime:
# Recommended pattern in algorithm code
def run(collection_id: str = None, **kwargs):
items = generate_stac_items(...)
for item in items:
if collection_id:
item.collection_id = collection_id
write_catalog(items)
When a user submits a DPS job, they can then pass their target collection ID as a job
input parameter without modifying the algorithm itself:
algorithm: my-flood-detector v1.2.0
inputs:
collection_id: jsmith--flood-catalog-2025
...
This convention should be documented as a best practice in the MAAP algorithm
authoring guide. Its benefits are:
- The same algorithm version can route to different collections (dev, staging,
production; personal vs. shared)
- Collection governance decisions are separated from algorithm logic — the algorithm
doesn't need to know or care about catalog organization
- Users who don't specify a
collection_id parameter get the deterministic fallback
behavior automatically, so the convention is opt-in and backward compatible
Algorithms that hardcode a collection ID in their output items will still work — the
authorization check applies regardless of how the collection ID got into the item —
but hardcoding is discouraged because it couples a specific catalog governance decision
to algorithm code that may be shared or reused by others.
Naming Rules
Enforced by the admin CLI now, by the self-service API later:
- Lowercase alphanumeric characters, hyphens, and underscores only
- 3–64 characters; no leading or trailing hyphens or underscores
- Case-insensitive uniqueness (
my-collection and My-Collection are the same)
- Reserved names blocked:
api, admin, system, search, conformance,
queryables, and any existing system collection patterns
Collection IDs are immutable after creation. The current deterministic ID formula
uses __ (double underscore) as a delimiter — user-specified IDs should avoid this
pattern to remain visually distinguishable from auto-assigned IDs.
Error Surfacing
DpsStacItemGenerator currently has no feedback channel back to the user after a DPS
job completes. Collection governance introduces new async failure modes — collection not
found, user not authorized, algorithm version not approved — that users need visibility
into. At minimum, the Lambda will emit structured CloudWatch log events for every
governance decision. A user-facing feedback mechanism (DPS job callback or ingestion
status dashboard) is a dependency that should be resolved before this feature ships.
Key Questions to Resolve Before Proceeding
1. What does tag represent in the deterministic ID formula, and how should it be
handled in the fallback path?
The current format is {username}__{algorithm_name}__{algorithm_version}__{tag}. It seems like everyone uses this field differently - either to group similar jobs or as a unique identifier. The existing system works well if the tag is used as a grouping field but does not work well if it is used as a unique identifier!
2. Multi-collection catalogs: authorize per item or require uniformity?
Because the collection ID is read from each STAC item, a single catalog could
theoretically contain items targeting different collections. Define whether this is
supported (each unique collection ID is authorized separately within one job) or
rejected (all items in a catalog must target the same collection ID). Requiring
uniformity is simpler to reason about and likely sufficient for current use cases.
3. What is the policy for an empty or absent approved_algorithms list in private?
Two reasonable interpretations: (a) no list means any algorithm is permitted for
authorized users (open by default, better for research flexibility), or (b) no list
means no algorithm is approved until explicitly configured (closed by default, better
for production data quality control). This should be decided as a platform-wide default
and documented clearly.
4. Can algorithm approval be version-wildcarded?
Should the approved list support {"name": "my-detector", "version": "*"} to approve
all versions of an algorithm, or must each version be explicitly listed? Explicit
versioning is stricter and better for production collections; wildcards are more
convenient for active development workflows. These could coexist as separate
authorization tiers.
5. Who can manage the approved_algorithms list — owner only, or contributors too?
Recommend owner only, to prevent contributors from approving their own algorithm
versions without the collection owner's sign-off. This defines the permission boundary
for the algorithm approval management path built in Phase 2.
6. Should the "unauthorized" outcome fall back to the deterministic ID or hard-fail?
Confirmed preference in prior discussion was hard failure, but this should be validated
with stakeholders since it is a behavioral change for existing users who might
inadvertently specify a collection ID they don't own.
7. Namespace convention: prefixed or flat?
Should user-specified IDs use a prefix convention (e.g., jsmith--flood-catalog) to
prevent naming conflicts, or rely on a flat namespace with uniqueness enforcement? A
prefix is easy to enforce in the admin CLI and eliminates conflicts entirely.
8. MAAP DPS extension scope: what else belongs in it?
Algorithm name and version are the obvious starting point for maap_dps:contributing_algorithms.
Should the extension also capture other DPS job context — platform version, compute
environment, job ID — at the collection level? Defining the extension scope now avoids
schema churn later. A companion item-level extension (recording per-item job provenance)
may also be worth scoping alongside the collection-level one.
9. Backfill scope and feasibility.
How many existing collections need private ownership records and maap_dps extension
fields added? Are algorithm_name and algorithm_version reliably recoverable from all
existing deterministic IDs via the __ delimiter pattern?
Work Breakdown (Phase 1)
Feature: User-Controlled STAC Collection IDs
Background
Currently, STAC collection IDs are auto-assigned by
DpsStacItemGeneratorusing adeterministic formula derived from the DPS job's
.met.jsonmetadata file:This value is slugified (special characters replaced) and then unconditionally written
into the
collectionfield of every STAC item before publishing to the ingestor queue —regardless of what collection the user's
catalog.jsonspecifies. Users have requestedthe ability to control the collection ID so that outputs from related jobs and algorithm
runs can be organized into a single, meaningfully named collection.
This ticket proposes an initial implementation using admin-mediated collection creation,
designed to extend toward self-service and algorithm-level authorization.
How the current pipeline works
DpsStacItemGenerator(link) is triggered by S3 event notifications when a DPS job writes acatalog.jsonto the output bucket. For each event:.met.jsonfile is loaded from that prefix — this is the authoritative source ofjob context, containing at minimum:
username,algorithm_name,algorithm_version, andtagcatalog.jsonis read via pystac; every item'scollectionfield isoverwritten with the deterministic ID before publishing to the ingestor SNS topic
Some users are already setting the
collectionfield in their STAC items, but thecurrent code silently overwrites it. This feature stops that overwrite and makes the
item-provided collection ID the primary routing mechanism.
Proposed Approach
Phase 1: Admin-Mediated Collection Creation (this ticket)
Users request a named collection via an out-of-band process (GitHub issue or intake
form). An admin creates the collection in pgSTAC using a new CLI tool, setting the
requesting user as owner. Users then set the
collectionfield in their algorithm'sSTAC item outputs;
DpsStacItemGeneratorwill respect it if authorization passes.Metadata storage design
There are two distinct categories of collection metadata introduced by this feature,
and they warrant different storage strategies:
Access control (who may write, which algorithms are approved) is governance data.
It should not appear in public STAC API responses and has no place in the STAC spec.
This belongs in pgstac's
privateJSONB column, which is purpose-built for backendmetadata that is never surfaced to API consumers:
Algorithm provenance (which DPS algorithms have contributed data to this
collection) is catalog metadata. It is legitimately useful to anyone browsing the
catalog and belongs in the public STAC collection document. Rather than overloading the
standard
providersfield — which is designed for data lineage attribution, notpipeline tracking — this is a natural fit for a MAAP DPS STAC extension:
This separation keeps
providersclean for its intended provenance purpose, keepsaccess control private, and gives MAAP DPS pipeline metadata a coherent versioned home
that can grow (e.g., DPS environment info, platform version) without polluting the
base STAC document. The extension schema would be maintained in this repository.
Note that
approved_algorithms(private, access control) andmaap_dps:contributing_algorithms(public, provenance) are related but distinct: theformer defines what is permitted, the latter records what has actually run. They are
updated at different times and by different actors.
Authorization logic in
DpsStacItemGeneratorWhen a catalog event is received, the Lambda reads job context from
.met.jsonastoday, then resolves the target collection ID from the STAC items themselves before
applying authorization:
If items do not specify a
collectionfield (or specify one that is absent frompgSTAC):
→ Fall back to the existing deterministic ID formula. Emit a structured warning to
CloudWatch if a collection was specified but not found in pgstac.
If the catalog specifies a collection ID that exists in pgstac:
usernameis owner or contributorprivatecolumnalgorithm_name+algorithm_versionapproved (if list non-empty)privatecolumnmaap_dps:contributing_algorithmsThe distinction between "collection not found" (soft fallback) and "collection found
but unauthorized" (hard failure) is intentional. A missing collection plausibly means
the user hasn't requested creation yet. An authorization failure means an explicit
policy was violated and silently redirecting items elsewhere would be a governance
failure.
Database query
Authorization is resolved by querying pgstac directly via the existing pgBouncer
connection (VPC-accessible EC2), avoiding HTTP overhead through the STAC API:
This is a single query per catalog event regardless of item count. pgBouncer in
transaction pooling mode handles concurrent Lambda invocations efficiently. On
successful authorization, a second update appends the contributing algorithm to the
public
contentif not already present.Admin CLI
A lightweight wrapper around
pypgstacthat:privatecolumn with owner, contributors, and optional approved algorithm listmaap_dpsextension fields in the public collection documentThis is the highest-leverage piece of initial work. Without it, the catalog will
accumulate collections with inconsistent ownership and algorithm approval records.
Backfill
Existing auto-assigned collections will have
privateownership records backfilledusing the
usernamealready embedded in their deterministic IDs. Thealgorithm_nameand
algorithm_versioncomponents are also present in the deterministic ID, soapproved_algorithmsandmaap_dps:contributing_algorithmscan be backfilled as well.This allows the authorization check in
DpsStacItemGeneratorto go live without alegacy carve-out for existing collections.
Phase 2: Self-Service Collection Creation (future)
The Phase 1 design is structured so that self-service collection creation slots in
without changing the
DpsStacItemGeneratorauthorization logic. The Lambda alreadyhandles all authorization outcomes. Future work will add a collection request UI or
API, synchronous ID reservation for race condition handling, and algorithm approval
management for collection owners.
Algorithm Authoring Convention
The collection ID should be treated as a runtime parameter, not a hardcoded value
inside algorithm code. DPS supports arbitrary named input parameters, and algorithms
should declare a
collection_idinput parameter that is passed through to the STACitem outputs at job runtime:
When a user submits a DPS job, they can then pass their target collection ID as a job
input parameter without modifying the algorithm itself:
This convention should be documented as a best practice in the MAAP algorithm
authoring guide. Its benefits are:
production; personal vs. shared)
doesn't need to know or care about catalog organization
collection_idparameter get the deterministic fallbackbehavior automatically, so the convention is opt-in and backward compatible
Algorithms that hardcode a collection ID in their output items will still work — the
authorization check applies regardless of how the collection ID got into the item —
but hardcoding is discouraged because it couples a specific catalog governance decision
to algorithm code that may be shared or reused by others.
Naming Rules
Enforced by the admin CLI now, by the self-service API later:
my-collectionandMy-Collectionare the same)api,admin,system,search,conformance,queryables, and any existing system collection patternsCollection IDs are immutable after creation. The current deterministic ID formula
uses
__(double underscore) as a delimiter — user-specified IDs should avoid thispattern to remain visually distinguishable from auto-assigned IDs.
Error Surfacing
DpsStacItemGeneratorcurrently has no feedback channel back to the user after a DPSjob completes. Collection governance introduces new async failure modes — collection not
found, user not authorized, algorithm version not approved — that users need visibility
into. At minimum, the Lambda will emit structured CloudWatch log events for every
governance decision. A user-facing feedback mechanism (DPS job callback or ingestion
status dashboard) is a dependency that should be resolved before this feature ships.
Key Questions to Resolve Before Proceeding
1. What does
tagrepresent in the deterministic ID formula, and how should it behandled in the fallback path?
The current format is
{username}__{algorithm_name}__{algorithm_version}__{tag}. It seems like everyone uses this field differently - either to group similar jobs or as a unique identifier. The existing system works well if the tag is used as a grouping field but does not work well if it is used as a unique identifier!2. Multi-collection catalogs: authorize per item or require uniformity?
Because the collection ID is read from each STAC item, a single catalog could
theoretically contain items targeting different collections. Define whether this is
supported (each unique collection ID is authorized separately within one job) or
rejected (all items in a catalog must target the same collection ID). Requiring
uniformity is simpler to reason about and likely sufficient for current use cases.
3. What is the policy for an empty or absent
approved_algorithmslist inprivate?Two reasonable interpretations: (a) no list means any algorithm is permitted for
authorized users (open by default, better for research flexibility), or (b) no list
means no algorithm is approved until explicitly configured (closed by default, better
for production data quality control). This should be decided as a platform-wide default
and documented clearly.
4. Can algorithm approval be version-wildcarded?
Should the approved list support
{"name": "my-detector", "version": "*"}to approveall versions of an algorithm, or must each version be explicitly listed? Explicit
versioning is stricter and better for production collections; wildcards are more
convenient for active development workflows. These could coexist as separate
authorization tiers.
5. Who can manage the
approved_algorithmslist — owner only, or contributors too?Recommend owner only, to prevent contributors from approving their own algorithm
versions without the collection owner's sign-off. This defines the permission boundary
for the algorithm approval management path built in Phase 2.
6. Should the "unauthorized" outcome fall back to the deterministic ID or hard-fail?
Confirmed preference in prior discussion was hard failure, but this should be validated
with stakeholders since it is a behavioral change for existing users who might
inadvertently specify a collection ID they don't own.
7. Namespace convention: prefixed or flat?
Should user-specified IDs use a prefix convention (e.g.,
jsmith--flood-catalog) toprevent naming conflicts, or rely on a flat namespace with uniqueness enforcement? A
prefix is easy to enforce in the admin CLI and eliminates conflicts entirely.
8. MAAP DPS extension scope: what else belongs in it?
Algorithm name and version are the obvious starting point for
maap_dps:contributing_algorithms.Should the extension also capture other DPS job context — platform version, compute
environment, job ID — at the collection level? Defining the extension scope now avoids
schema churn later. A companion item-level extension (recording per-item job provenance)
may also be worth scoping alongside the collection-level one.
9. Backfill scope and feasibility.
How many existing collections need
privateownership records andmaap_dpsextensionfields added? Are
algorithm_nameandalgorithm_versionreliably recoverable from allexisting deterministic IDs via the
__delimiter pattern?Work Breakdown (Phase 1)
tagsemantics and its role (if any) in the deterministic fallback IDin a single job) are supported or rejected
maap_dps:contributing_algorithmsandscope of additional fields); publish schema document in this repository
privatecolumninitialization, and
maap_dpsextension field initialization)privateownership records andmaap_dpsextension fields on existingcollections
DpsStacItemGeneratorto:privatecolumn for ownership and approved algorithm listmaap_dps:contributing_algorithmsincontenton successful ingestion