diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 38a845e81..d6049a887 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -17,7 +17,7 @@ - under the License. --> -Recommendations and requirements for how to best contribute to Parquet. We strive to obey these as best as possible. As always, thanks for contributing--we hope these guidelines make it easier and shed some light on our approach and processes. +Recommendations and requirements for how to best contribute to Parquet. We strive to obey these as best as possible. As always, thanks for contributing--we hope these guidelines make it easier and shed some light on our approach and processes. If you believe there should be a change or exception to these rules please bring it up for discussion on the developer mailing list (dev@parquet.apache.org). ### Key branches - `master` has the latest stable changes @@ -29,3 +29,174 @@ Recommendations and requirements for how to best contribute to Parquet. We striv ### License By contributing your code, you agree to license your contribution under the terms of the APLv2: https://github.com/apache/parquet-format/blob/master/LICENSE + +### Additions/Changes to the Format + +Note: This section applies to actual functional changes to the specification. +Fixing typos, grammar, and clarifying concepts that would not change the +semantics of the specification can be done as long as a committer feels comfortable +to merge them. When in doubt starting a discussion on the dev mailing list is +encouraged. + +The general steps for adding features to the format are as follows: + +1. Design/scoping: The goal of this phase is to identify design goals of a + feature and provide some demonstration that the feature meets those goals. + This phase starts with a discussion of changes on the developer mailing list + (dev@parquet.apache.org). Depending on the scope and goals of the feature the + it can be useful to provide additional artifacts as part of a discussion. The + artifacts can include a design docuemnt, a draft pull request to make the + discussion concrete and/or an prototype implementation to demostrate the + viability of implementation. This step is complete when there is lazy + consensus. Part of the consensus is whether it is sufficient to provide two + working implementations as outlined in step 2, or if demonstration of the + feature with a downstream query engine is necessary to justify the feature + (e.g. demonstrate performance improvements in the Apache Arrow C++ Dataset + library, the Apache DataFusion query engine, or any other open source + engine). + +2. Completeness: The goal of this phase is to ensure the feature is viable, + there is no ambiguity in its specification by demonstrating compatibility + between implementations. Once a change has lazy consensus, two + implementations of the feature demonstrating interopability must also be + provided. One implementation MUST be + [`parquet-java`](http://github.com/apache/parquet-java). It is preferred + that the second implementation be + [`parquet-cpp`](https://github.com/apache/arrow) or + [`parquet-rs`](https://github.com/apache/arrow-rs), however at the discretion + of the PMC any open source Parquet implementation may be acceptable. + Implementations whose contributors actively participate in the community + (e.g. keep their feature matrix up-to-date on the Parquet website) are more + likely to be considered. If discussed as a requirement in step 1 above, + demonstration of integration with a query engine is also required for this + step. The implementations must be made available publicly, and they should be + fit for inclusion (for example, they were submitted as a pull request against + the target repository and committers gave positive reviews). Reports on the + benefits from closed source implementations are welcome and can help lend + weight to features desirability but are not sufficient for acceptance of a + new feature. + +Unless otherwise discussed, it is expected the implementations will be developed +from their respective main branch (i.e. backporting is not required), to +demonstrate that the feature is mergeable to its implementation. + +3. Ratification: After the first two steps are complete a formal vote is held on + dev@parquet.apache.org to officially ratify the feature. After the vote + passes the format change is merged into the `parquet-format` repository and + it is expected the changes from step 2 will also be merged soon after + (implementations should not be merged until the addition has been merged to + `parquet-format`). + +#### General guidelines/preferences on additions. + +1. To the greatest extent possible changes should have an option for forward + compatibility (old readers can still read files). The [compatibility and + feature enablement](#compatibility-and-feature-enablement) section below + provides more details on expectations for changes that break compatibility. + +2. New encodings should be fully specified in this repository and not + rely on an external dependencies for implementation (i.e. `parquet-format` is + the source of truth for the encoding). If it does require an + external dependency, then the external dependency must have its + own specification separate from implementation. + +3. New compression mechanisms should have a pure Java implementation that can be + used as a dependency in `parquet-java`, exceptions may be + discussed on the mailing list to see if a non-native Java + implementation is acceptable. + +### Releases + +The Parquet PMC aims to do releases of the format package only as needed when +new features are introduced. If multiple new features are being proposed +simultaneously some features might be consolidated into the same release. +Guidance is provided below on when implementations should enable features added +to the specification. Due to confusion in the past over Parquet versioning it +is not expected that there will be a 3.x release of the specification in the +foreseeable future. + +### Compatibility and Feature Enablement + +For the purposes of this discussion we classify features into the following buckets: + +1. Backward compatible. A file written under an older version of the format + should be readable under a newer version of the format. + +2. Forward compatible. A file written under a newer version of the format with + the feature enabled can be read under an older version of the format, but + some metadata might be missing or performance might be suboptimal. Simply + phrased, forward compatible means all data can be read back in an older + version of the format. New logical types are considered forward + compatible despite the loss of semantic meaning. + +3. Forward incompatible. A file written under a newer version of the format with + the feature enabled cannot be read under an older version of the format (e.g. + adding and using a new compression algorithm). It is expected any feature in + this category will provide a signal to older readers, so they can + unambiguously determine that they cannot properly read the file (e.g. via + adding a new value to an existing enum). + +New features are intended to be widely beneficial to users of Parquet, and +therefore it is hoped third-party implementations will adopt them quickly after +they are introduced. It is assumed that writing new parts of the format, and +especially forward incompatible features, will be configured with a feature flag +defaulted to "off", and at some future point the feature is turned on by default +(reading of the new feature will typically be enabled without configuration or +defaulted to on). Some amount of lead time is desirable to ensure a critical +mass of Parquet implementations support a feature to avoid compatibility issues +across the ecosystem. Therefore, the Parquet PMC gives the following +recommendations for managing features: + +1. Backward compatibility is the concern of implementations but given the + ubiquity of Parquet and the length of time it has been used, libraries should + support reading older versions of the format to the greatest extent possible. + +2. Forward compatible features/changes may be enabled and used by default in + implementations once the parquet-format containing those changes has been + formally released. For features that may pose a significant performance + regression to older format readers, libaries should consider delaying default + enablement until 1 year after the release of the parquet-java implementation + that contains the feature implementation. + +3. Forward incompatible features/changes should not be turned on by default + until 2 years after the parquet-java implementation containing the feature is + released. It is recommended that changing the default value for a forward + incompatible feature flag should be clearly advertised to consumers (e.g. via + a major version release if using Semantic Versioning, or highlighed in + release notes). + +For forward compatible changes which have a high chance of performance +regression for older readers and forward incompatible changes, implementations +should clearly document the compatibility issues. Additionally, while it is up +to maintainers of individual open-source implementations to make the best decision to serve +their ecosystem, they are encouraged to start enabling features by default along +the same timelines as `parquet-java`. Parquet-java will wait to enable features +by default until the most conservative timelines outlined above have been +exceeded. This timeline is an attempt to balance ensuring +new features make their way into the ecosystem and avoiding +breaking compatiblity for readers that are slower to adopt new standards. We +encourage earlier adoption of new features when an organization using Parquet +can guarantee that all readers of the parquet files they produce can read a new +feature. + +After turning a feature on by default implementations +are encouraged to keep a configuration to turn off the feature. +A recommendation for full deprecation will be made in a future +iteration of this document. + +For features released prior to October 2024, target dates for each of these +categories will be updated as part of the `parquet-java 2.0` release process +based on a collected feature compatibility matrix. + +For each release of `parquet-java` or `parquet-format` that influences this +guidance it is expected exact dates will be added to parquet-format to provide +clarity to implementors (e.g. When `parquet-java` 2.X.X is released, any new +format features it uses will be updated with concrete dates). As part of +`parquet-format` releases the compatibility matrix will be updated to contain +the release date in the format. Implementations are also encouraged to provide +implementation date/release version information when updating the feature +matrix. + +End users of software are generally encouraged to consult the feature matrix +and vendor documentation before enabling features that are not yet widely +adopted.