.. _debusine-concepts:

=================
Debusine concepts
=================

Debusine has been designed to run a network of generic “workers” that can
perform various “tasks” producing “artifacts”. Interesting artifacts that
we want to keep in the long term are stored in “collections”.

While tasks can be scheduled as individual “work requests”, the power of
Debusine lies in its ability to combine multiple (different) tasks in
“workflows”, where each workflow has its own logic to orchestrate multiple
work requests across the available workers.

A Debusine instance can be multi-tenant, divided into “scopes” of users
and groups. These contain “workspaces” that have their own sets of
artifacts and collections. Workspaces can inherit from one another in
order to share collections and artifacts when required.

.. _explanation-artifacts:

Artifacts
=========

Artifacts are at the heart of Debusine. They combine:

* a set of files
* key-value data (stored as a JSON-encoded dictionary)
* a category

The category is just a string identifier used to recognize artifacts sharing
the same structure. You can create and use categories as you see fit but we
have defined a basic :ref:`ontology <artifact-reference>` suited for the
case of a Debian-based distribution.

Artifacts are both inputs (submitted by users) and outputs (generated by
tasks). They are created and stored in a workspace and can have an
expiration delay controlling their lifetime. Artifacts are (mostly)
immutable, they should never be modified after creation.

Files in artifacts are content-addressed (stored by hash) in the
database, so a single file can be referenced in multiple places without
unnecessary data duplication.

Files in artifacts have names that may include directories.

Artifacts can have relations with other artifacts, see :ref:`artifact
relationships <artifact-relationships>`.

.. _explanation-assets:

Assets
======

Assets are typed holders of key-value data with strong permissions to
encode how the data may be used. Similarly to Artifacts, the category
is used to distinguish Assets sharing the same structure and purpose. See
the :ref:`ontology <assets>` of possible assets.

Assets are used to store credentials (with permissions dictating who can
see/edit/use those credentials). They are also used to represent external
objects with permissions for the various operations that are possible
on those objects. One example of these objects is a signing key (managed
by a signing worker, possibly stored in a hardware security module), and
the associated permissions encode who is able to generate signatures with
that key.

.. _explanation-collections:

Collections
===========

A collection is an abstraction used to manage and store a coherent set of
"collection items". Each collection has a ``category`` field describing
its intended use case, the allowed collection items, the associated
metadata, etc. See the :ref:`reference <collection-reference>`.

A collection is meant to represent things like this:

* A suite in the Debian archive (e.g. "Debian bookworm"): the
  :collection:`debian:suite` collection is a collection of
  :artifact:`debian:source-package` and :artifact:`debian:binary-package`
  artifacts.
* A Debian archive (a.k.a. repository) that contains multiple suites:
  the :collection:`debian:archive` collection is a collection of
  :collection:`debian:suite` collections
* Build chroots for all Debian suites: the :collection:`debian:environments`
  collection stores :artifact:`debian:system-tarball` artifacts for multiple
  Debian suites
* The results of a lintian analysis or autopkgtests runs across all the
  packages in a target suite
* Extracted ``.desktop`` files for each package name in a suite

To cover for those various cases, each collection item consists of some
arbitrary metadata and can optionally link to an artifact or to a
collection. Hence we define 3 kinds of collection items:

* artifact-based items: they link an artifact with some metadata
* collection-based items: they link a collection with some metadata
* bare-data items: they only store some metadata

Each collection item has its own "category" that defines the nature of the
item. For artifact-based items and collection-based items, it duplicates
the category of the linked artifact or collection. For bare-data items, it
indirectly defines the structure to expect in the metadata.

A collection item also has a unique name within the collection so that
the collection can be seen like a big Python dictionary mapping names
to artifacts, collections and arbitrary data.

Collections can be uniquely identified within a workspace by category and
name, and can provide useful starting points for further lookups within
collections.

To learn more about collections, you can read more details about their
:ref:`data model <collection-data-models>`.

.. _explanation-tasks:

Tasks
=====

Tasks are time-consuming operations that are typically offloaded to
dedicated workers.

Debusine contains a library of tasks to perform various operations that
are useful when you contribute to Debian or one of its derivatives ("build
a package", "run lintian", "upload a package", etc.).

The behaviour of each task can be controlled/customized with some input
parameters. The combination of a task and actual input parameters
constitutes a :ref:`work request <explanation-work-requests>` that can be
scheduled to run.

There are :ref:`six types of tasks <reference-task-types>` but the most
interesting ones are the ``Worker``, ``Server`` and ``Signing`` tasks.

Worker tasks
~~~~~~~~~~~~

Worker tasks run on external workers, often within some controlled
:ref:`execution environments <reference-execution-environment>`. They can
only interact with Debusine through the public API. Hence they will
typically only consume and produce artifacts, and create relationships
between them.

Worker tasks can require specific features from the workers on which they
will run. This is used to ensure that the assigned worker will have all
the required resources for the task to succeed.

Signing tasks
~~~~~~~~~~~~~

Signing tasks are very much like worker tasks, except that they have
access to a local database containing sensitive cryptographic material
(i.e. private keys) that needs to be stored in a secure manner and whose
access should be tightly controlled.

Server tasks
~~~~~~~~~~~~

Server tasks perform operations that require direct database access
and that may take some time to run. They run on Celery workers, and must
not execute any user-controlled code.

.. _explanation-work-requests:

Work requests
=============

Work requests are the way Debusine schedules tasks to workers and monitors
their progress and success. Basically it ties together a task (that is
some code to execute on a worker) together with its parameters (values used
to customize the behaviour of the task).

.. note::

   There are different :ref:`types of tasks <explanation-tasks>`, but they
   all share the same work request structure for the purpose of being
   scheduled. This includes workflows, thus much of what is said about
   work requests also apply to the concept of workflows even if we present
   workflows separately from tasks due to their special role in Debusine.

Worker tasks and workflows are the two types of tasks that can be
scheduled individually by Debusine users. All the other types of tasks are
restricted and can only be started indirectly through one of the workflows
that is available in the workspace.

A work request is tied to a workspace. This defines what the task has
access too and where its output will be stored.  The :ref:`artifacts
<explanation-artifacts>` generated as output by the task are linked to the
work request and can be easily reused.

To learn more about work requests, you can read:

* :ref:`work-request-scheduling` for more explanations about how work
  requests are scheduled.
* :ref:`work-requests` for more information about the data model and all
  the special cases.

.. _explanation-workflows:

Workflows
=========

Workflows bring advanced orchestration logic to Debusine: they
combine multiple individual tasks in a meaningful way. A workflow
can:

* start multiple work requests
* analyze their results to decide on the next steps
* reuse the output of a work request as input for another work request
* extract data from collections
* feed data into collections
* etc.

Here are some examples of possible workflows:

 * Package build: it would take a source package and a target distribution
   as input parameters, and the workflow would automate the following
   steps:
   { sbuild on all architectures supported in the target distribution }
   → add source and binary packages to target distribution.

   See :workflow:`sbuild` workflow.

 * Package review: it would take a source package and associated binary
   packages and a target distribution, and the workflow would control
   the following steps:
   { generating debdiff between source packages, lintian, autopkgtest,
   autopkgtests of reverse-dependencies } → manual validation by reviewer
   → add source and binary packages to target distribution.

 * Both build and review could be combined in a larger workflow.

   In that case, the reverse-dependencies whose autopkgtests should be run
   cannot be identified until the sbuild task has completed, so the
   workflow would be expanded/reconfigured after that step completed.

 * Update a collection of lintian analyses of the latest packages in a
   given distribution based on the changes of the collection
   representing that distribution.

   Here again the set of lintian analyses to run depends on a first
   step of comparison between the two collections.

Terminology
~~~~~~~~~~~

We often use the term "workflow" in different contexts to refer to
different things. Workflows are a special kind of Work Request so we have
the same distinction between a Task (the code) and Work Request (a running
instance of the code with specific parameters). Here are the terms that
we use, in the context of workflows, to distinguish between them:

* **Workflow Implementation**: the code implementing the orchestration
  logic. Each workflow implementation uses many input parameters. They are
  documented in the :ref:`reference documentation <workflow-reference>`.

* **Workflow Instance**: it's a Work Request associating a Workflow
  Implementation to a set of input parameters. It has its own lifecycle
  from creation up to completion.

* **Workflow Template**: it's really a shortcut for a *Workflow Instance
  Template*. It is a pre-configured workflow provided by the
  workspace administrator that can be turned into workflow instances by
  users. More on this below.

.. _explanation-workflow-template:

Workflow Template
~~~~~~~~~~~~~~~~~

Workflows are powerful operations in particular due to their ability
to run server tasks. Due to this, users cannot start arbitrary workflows,
they can only start the subset of workflows that have been made available
by the workspace administrator through *workflow templates*. A workflow
template:

* grants a unique name to a pre-configured workflow so that it can be
  easily identified and started by users
* defines all the input parameters that cannot be overridden when a user
  starts the workflow

The input parameters that are not set in the workflow template are
called run-time parameters and they have to be provided by the user
that starts the workflow.

The resulting input parameters are stored in a WorkRequest model with
task_type ``workflow`` that will be used as the root of a WorkRequest
hierarchy covering the whole duration of the process controlled by the
workflow. See :ref:`workflow-orchestration` to learn more about how
child work requests are created.

.. _explanation-file-stores:

File stores
===========

Files in artifacts are stored in file stores.  These are content-addressed:
a file with a given SHA-256 digest is only stored once in any given store,
and may be retrieved by that digest.  When a new artifact is created, its
files are uploaded to stores as needed.  Some of the files may already be
present.  In that case, if the file is already part of the artifact's
workspace, then it does not need to be reuploaded; but otherwise, it must be
reuploaded to avoid users obtaining unauthorized access to existing file
contents.

:file-backend:`Local` storage is useful as the initial destination for
uploads to Debusine, but it has to be backed up manually and might not scale
to sufficiently large volumes of data.  Remote storage such as
:file-backend:`S3` is also available.  It is possible to serve a file from
any store, with policies for which one to prefer for downloads and uploads.

Administrators can set policies for which file stores to use at the
:ref:`scope <explanation-scopes>` level, as well as policies for populating
and draining stores of files.  Most bulk movement is handled by a periodic
job.

To learn more about file stores, see their :ref:`reference
<file-store-reference>`.

.. _explanation-scopes:

Scopes
======

Scopes are the foundational concept used to implement multi-tenancy in
Debusine. They are an administrative grouping of users, groups and
workspaces. They appear as the initial segment in the URL path of most web
views.

Groups and workspaces can only exist in a single scope. Users are global
and might be part of multiple scopes.

Since artifacts have to be stored somewhere, scopes also define the set of
:ref:`file stores <explanation-file-stores>` where files can be stored.

.. _explanation-workspaces:

Workspaces
==========

A workspace is an administrative concept hosting artifacts and
collections. Users can get different levels of access to those artifacts
and collections by being granted different roles on the workspace.

Workspaces have the following important properties:

* public: a boolean which indicates whether the artifacts are publicly
  accessible or if they are restricted to the users belonging to the
  workspace
* default_expiration_delay: the minimal duration that a new
  artifact is kept in the workspace before being expired. See
  :ref:`expiration-of-data`.

To learn more about workspaces, see their :ref:`reference
<workspace-reference>`.

.. _explanation-workers:

Workers
=======

Workers are services that run :ref:`tasks <explanation-tasks>` on behalf of
a Debusine server.  There are three types of worker.

External workers
~~~~~~~~~~~~~~~~

Most workers are external workers, running an instance of
``debusine-worker``.  This is a daemon that runs untrusted tasks using some
form of containerization or virtualization.  It has no direct access to the
Debusine database; instead, it interacts with the server using the HTTP API
and WebSockets.

External workers process one task at a time, and only process ``Worker``
tasks.

To support spikes in work requests, Debusine is able to use
:ref:`dynamic-worker-pools` to host external workers in clouds.
These are provisioned as required, and terminated when idle.

Celery workers
~~~~~~~~~~~~~~

A Debusine instance normally has an associated Celery worker, which is used
to run tasks that require direct access to the Debusine database.  These
tasks are necessarily trusted, so they must not involve running
user-controlled code.

Celery workers have a concurrency level, normally set to the number of
logical CPUs in the system (:py:func:`os.cpu_count`).

.. todo::

   Document (and possibly fix) what happens when workers are restarted while
   running a task.

Signing workers
~~~~~~~~~~~~~~~

Signing workers work in a similar way to external workers, but they have
access to private key material, either directly or via a hardware security
module.  They only process ``Signing`` tasks.
