BURST is software and a zip-based archive format that offers an optimized integration between Amazon S3 and the BTRFS Linux filesystem. It is probably the fastest way to load large numbers of files onto an EC2 instance from S3.
BURST can also be used without BTRFS, with different but generally good performance characteristics.
Download the latest release for your platform from GitHub Releases.
# Example for Linux x86_64 (replace v1.0.0 with the latest version)
VERSION=v1.0.0
curl -LO https://github.com/posit-dev/burst/releases/download/${VERSION}/burst-${VERSION}-linux-x86_64.tar.gz
tar -xzf burst-${VERSION}-linux-x86_64.tar.gz
sudo mv burst-${VERSION}-linux-x86_64/burst-writer burst-${VERSION}-linux-x86_64/burst-downloader /usr/local/bin/Verify the download (optional):
curl -LO https://github.com/posit-dev/burst/releases/download/${VERSION}/checksums.txt
sha256sum -c checksums.txt --ignore-missingTo use BURST, first create a BURST archive of the files you wish to save in S3:
burst-writer -o name-of-archive.zip /path/to/directory
This will create an archive file containing all the files and folders under /path/to/directory in S3.
Direct upload to S3 not currently implemented -- you'll need to then upload this file to S3 using another tool.
One of BURST's optimization strategies involves a direct (ioctl) interaction with the BTRFS filesystem. This interaction requires
that the downloader run as root (or in any way that grants CAP_SYS_ADMIN.) BURST also preserves file ownership, which
requires these permissions as well.
sudo ./burst-downloader -b name-of-bucket -k archive-name-in-S3 -r aws-region -o /path/to/restore/to
This will download the archive from S3 and recreate the data at /path/to/restore/to.
It is also possible to run the downloader without elevated permissions. In this mode, the data has to be immediately
decompressed as it is downloaded and written to disk using conventional write()s. This approach has higher disk throughput
requirements, higher CPU utilization, and lower disk use efficiency.
It is a design priority that data captured into BURST formatted S3 objects can be recovered in future using generally available tools, without necessarily relying on BURST software or running on systems BURST supports.
BURST archives are compliant with the ZIP specification, and zip extractors unaffiliated with the BURST project exist that are capable of extracting BURST zip archives -- albeit without the performance optimizations. Specifically, the zip writer logic is rigorously tested against 7-Zip. Any archives generated by BURST that 7-Zip cannot extract correctly would be considered a bug. More extractors may be added to the test matrix in future.
Note that lots of popular zip extractor implementations cannot process BURST archives because they do not support the ZStandard compression algorithm that BURST utilizes.
At a high level, the most notable difference between a BURST zip archive and any regular ZIP archive is that certain binary structures are guaranteed to occur at every 8th MiB of the overall object.
You should not run the BURST downloader on archives that could have been created by untrusted writers, especially when running it as root. There are several reasons for this:
- BURST preserves file ownership, and so would enable an untrusted source to craft a zip stream that created executable files owned by priviliged users or even potentially overwrote files critical to system integrity.
- At present, no work has been applied to fuzzing the downloader for intentionally malicious zip streams. It is probable that security vulnerabilities exist when presented with intentionally malformed input.
The downloader is currently capable of restoring a dataset comprised of a comprehensive Ubuntu-based system of around 250,000 inodes spanning 8.7 GiB in 5.9 - 6.2 seconds on an i7ie.3xlarge ec2 instance.
Much more performance benchmarks to come.
The fundamental techniques leveraged are
-
Adheres to published patterns for high performance S3 utilization, including
- Many concurrent TCP streams utilizing byte-range fetches, with download requests typically (can be overridden) aligned to the part boundaries used during multipart upload.
- Does not issue requests that yield small amounts of response data, e.g. for small objects. S3 offers high sequential read bandwidth but also high roundtrip latencies, so the ability to create many small files from a single S3 request greatly improves small file restoration performance.
- Internally utilizes aws-c-s3, a library that obsesses over maximizing the performance of S3 in the EC2 environment. This enables us to inherit many micro-optimizations such as DNS load balancing across the S3 fleet.
-
After the zip central directory is fetched (typically one roundtrip), response data from all concurrent multipart downloads is immediately streamed to the final physical disk blocks where that data needs to reside. The downloader knows what part of what file any given byte-range download relates to. There is no need to download to a temporary location first or wait for other concurrent downloads to complete.
-
Passthrough ZStandard compression, meaning the compression in s3 is the same as the compression on disk after restoration. BTRFS only decompresses on read. This is important because it reduces total data that must be written out, which is usually the bottleneck: even direct-attached NVMe instance storage in EC2 typically has much lower write throughput than the instance's network bandwidth to S3. Leaving the data compressed also reduces CPU utilization during extraction and enables more total information to be placed on valuable instance block storage.
The resulting decompress on read does increase CPU overhead when performing reads that are page cache misses, but given ZStandard's exceptional decompress performance, the disk savings and improved extraction performance are probably desirable for most applications.
See the docs/ folder for an overview of the BURST format and principles of its efficient use.
sudo apt-get install -y ruby cmake libzstd-dev zlib1g-devmkdir build
cd build
cmake ..
makeTo run the Zstandard compression tests, you need 7-Zip with Zstandard codec support.
p7zip-full (7-Zip 23.01+dfsg) strips Zstandard codec support for DFSG (Debian Free Software Guidelines) compliance. You must install the official 7-Zip from 7-zip.org instead.
sudo apt-get install -y unzip# Via Makefile (recommended)
make test # All tests
make test-unit # Unit tests (< 5s)
make test-integration # Fast integration (~1min with 4 parallel jobs)
make test-slow # Slow E2E tests (~5min with 4 parallel jobs)
# Control parallelism
CTEST_PARALLEL_LEVEL=8 make test-integration # Use 8 parallel jobs
CTEST_PARALLEL_LEVEL=1 make test-integration # Disable parallelism
# Via CTest (from build directory)
ctest # Run all tests
ctest -V # Verbose output
ctest -R test_alignment # Run specific test
ctest -L slow --parallel 4 # Slow tests with 4 jobs