There are any number of scientific workflow managers out there, and by the time I finish this article a few more will have popped into existence. It can integrate with existing systems or stand on it's own.
As all of your configuration is written in code it is also extremely flexible. Out of the box you get lots of niceness, including a nice web interface with a visual browser of your tasks, a scheduler, configurable parallelism, logging, watchers and any number of executors. Then there are also Sensors, which are nice and shiny ways of waiting for various operations, whether that is waiting on a file to appear, a record in a database to appear, or another task to complete. These will often be Bash, Python, SSH, but can also be even cooler things like Docker, Kubernetes, AWS Batch, AWS ECS, Database Operations, file pushers, and more. Operators are an abstraction on the kind of task you are completing. Your DAG is comprised of Operators and Sensors. If you aren't familiar with this term it's really just a way of saying Step3 depends upon Step2 which depends upon Step1, or Step1 -> Step2 -> Step3.Īpache Airflow uses DAGs, which are the bucket you throw you analysis in. There are a ton of great introductory resources out there on Apache Airflow, but I will very briefly go over it here.Īpache Airflow gives you a framework to organize your analyses into DAGs, or Directed Acyclic Graphs. If you prefer to watch I have a video where I go through all the steps in this tutorial.Īirflow is a platform created by the community to programmatically author, schedule and monitor workflows. These tasks are much easier to accomplish when you have a system or framework that is built for scientific workflows. You need to put your data in a database and set up in depth analysis pipelines. Once you have results you need to decide on a method of organization.If you have a large dataset and you want to get it analyzed sometime this century you need to split your analysis, run, and then gather the results.Keep track of dependencies of CellProfiler Analyses - first run an illumination correction and then your analysis.Trigger CellProfiler Analyses, either from a LIMS system, by watching a filesystem, or some other process.
If you are running a High Content Screening Pipeline you probably have a lot of moving pieces.