Context and motivation
Computational biology and computational biodiversity currently present a fractured landscape of methods and platforms, that complicates the process of reproducing scientific results. Even within the same lab, it can oftentimes be difficult to take up a past study, or to reuse parts of it in a new study. While electronic notebooks like Jupyter and platforms like Galaxy address part of this problem, it is hard to build hybrid environments where such tools are only one component.
Historically, the tendency has been to build all-in-one platforms, including specialized web portals or generalist platforms like the above. These platforms incorporate predefined and preorchestrated services, such as databases, compute servers, navigation interfaces. They propose extensible lists of tools. On most of them, users can extract workflows from the history of their activities, and share them.
While such platforms have proven extraordinarily useful, in the field it is almost always true that part of the analysis has to been performed outside of the platform and escapes from being documented in the workflow. What is missing, is a system by which a users can define an ad hoc, custom work environment, deploy it where they want, record its state, share and mix it with the environment of their colleagues.
The Alcyone paradigm
Alcyone instantiates bioinformatics environments from text specifications committed to a Git repository. The core goal is to bring to computational biology the same collaboration tools and the same rigor that have proven themselves for software development. Since the invention of Git and its lightweight branching workflows, computer scientists have learned to use highly effective practices for working asynchronously and non-linearly. Git commits provide a single identifier of the state of a directory, and Git branches permit independent work that can be merged into one state at a later date.
Alcyone defines a file structure for the specifying bioinformatics analysis environments, including tool choice, interoperability, and sources of raw data. These specifications are recorded in a Git repository. Alcyone compiles a specification into a master Docker container that deploys and orchestrates containers for each of the component tools. Alcyone can restore any version of an environment recorded in the Git repository.
Alcyone conceives the user’s computing environment as a microservices architecture, where each bioinformatics tool in the specification is a separate containerized Docker service. Alcyone builds a master container for the specified environment that is responsible for building, updating, deploying and stopping these containers, as well as recording and sharing the environment in a Git repository. The master container can be manipulated using a command-line interface.
Alcyone is under active development and has not been formally released. The project can be found at https://gitlab.inria.fr/alcyone/alcyone. To request access to the project, please contact David Sherman.
Typical use of Alcyone is to extract a command script:
docker run --rm alcyone/alcyone > alcyone.sh
This script is then used to instantiate the environment defined in the alcyone.yaml specification in the current directory, mounting the data directories inside the appropriate containers.
When Alcyone environments are stopped, each service records its configuration into files in the directory.
Recording your environment
Your Alcyone environment is specified by the contents of the directory, and is recorded using Git. The Alcyone script includes pre-configured Git operations as a convenience:
You should not check high-volume data into your Git repository, and should exclude them using a .gitignore specification. If you want to record your data in your Git commits, we recommend using git-lfs.