Skip to content
Steven N Hart edited this page Mar 8, 2016 · 13 revisions

In order to participate we do encourage some level of standardization to make this process a bit smoother for everyone.

Data

To make sure the comparisons are as fair as possible, all comparisons need to use the same dataset. However, the way in which it was used can be in whatever structure you find works best. In this challenge, we will be using a publicly available dataset: the 1000 Genomes. These files can be parsed and transformed into any format necessary for your database schema. All data must be loaded into your database, so no pre-filtering!

Special Note: the AD field in the VCF file will need to be parsed in one of the challenges, so make sure you pay attention to that! See more details about each challenge on the Home page.

Directory structure

  • Database Name
    • user/
      • README.md
        • This markdown document should explain how to run you code. Importantly, it must contain sufficient detail so that someone else can replicate your results. This would include which particular database version you use, any instructions for sharding or indexing, etc.
      • SCHEMAs.md
        • This document should contain an example entry for each document or table in your database, so that users can easily understand your design at a glance.
      • scripts/
        • All scripts used from the initial curl or wget calls to database import and query will go here.

Note: 🐳 Docker containers are extremely helpful here to avoid assumptions about dependencies and ensures the toolkit will work in other people's hands!

Transparency

All code must be made publicly available. In particular, we will make the code available on this site so others can learn from your approach. No code, no contribution. Remember, this is a learning exercise and being open is a critical component.

Clone this wiki locally