<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Franky.Codes - vantage6</title><link href="https://franky.codes/" rel="alternate"></link><link href="https://franky.codes/feeds/tag.vantage6.atom.xml" rel="self"></link><id>https://franky.codes/</id><updated>2025-02-13T09:00:00+01:00</updated><entry><title>From Docker to Kubernetes</title><link href="https://franky.codes/from-docker-to-kubernetes.html" rel="alternate"></link><published>2025-02-13T09:00:00+01:00</published><updated>2025-02-13T09:00:00+01:00</updated><author><name>Frank Martin</name></author><id>tag:franky.codes,2025-02-13:/from-docker-to-kubernetes.html</id><summary type="html">&lt;p class="first last"&gt;Why we are moving from Docker to Kubernetes in version 5&lt;/p&gt;
</summary><content type="html">&lt;p&gt;The vantage6 infrastructure has a tight coupling with Docker since the beginning of vantage6 in 2017. The node component relies on the Docker API to start the containers that do the computation on the privacy sensitive data. At the time, Docker was a solid choice as it had development tooling and was free to use even for big commercial projects. Since then a few things have changed:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Docker changed it license policy: in some cases, a license is now required to use Docker.&lt;/li&gt;
&lt;li&gt;Alternative container technologies caught up with Docker in terms of functionality and tooling, for example, Podman and Singularity.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Back in 2017 we also considered using the Kubernetes API instead of the Docker API for starting the computation containers. In this case, we still would have used the Docker engine for running the containers, but Kubernetes would be the interface for managing them (starting, stopping, etc.). This would have been a valid choice, however it was not implemented at the time as there were more pressing things on the roadmap.&lt;/p&gt;
&lt;p&gt;Since 2017, vantage6 itself has also changed considerably. Where once every major component (client, node, server) consisted of one component it now consists of several. In the future, more components will be added in a similar way. Working with a microservice architecture has many advantages (if you are interested, they are listed &lt;a class="reference external" href="https://about.gitlab.com/blog/2022/09/29/what-are-the-benefits-of-a-microservices-architecture/"&gt;here&lt;/a&gt;).&lt;/p&gt;
&lt;div class="section" id="challenges"&gt;
&lt;h2&gt;Challenges&lt;/h2&gt;
&lt;p&gt;While developing vantage6 further, we were starting to hit the limitations of Docker. We then created our own tooling, packages and features for use cases that were already supported in Kubernetes. Let us highlight a few important points.&lt;/p&gt;
&lt;div class="section" id="security"&gt;
&lt;h3&gt;Security&lt;/h3&gt;
&lt;p&gt;In the vantage6 node, we mount the Docker socket so that we can create containers that perform the computation from the vantage6 node application. The Docker daemon process on the host runs as root. This effectively means that the vantage6 node has root access to the host machine. This security issue should be considered by data parties when setting up a vantage6 node.&lt;/p&gt;
&lt;p&gt;The necessity for root access could be dropped by using Docker's &lt;a class="reference external" href="https://docs.docker.com/engine/security/rootless/"&gt;rootless mode&lt;/a&gt;, but that requires more complex configuration steps during node installation, and the vantage6 node will still have unlimited permissions to create containers.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;When moving to Kubernetes we no longer need the Docker socket, so the security model is much more transparent and can be controlled by configuring the Kubernetes API access.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="node-to-node-communication"&gt;
&lt;h3&gt;Node-to-node communication&lt;/h3&gt;
&lt;p&gt;There is a large community out there creating privacy enhancing (or maybe even preserving) algorithms. For example Flower, ScaleMamba, MPyC, Substra, and the list goes on. All these libraries are great and would it not be awesome if we could use them in vantage6 without modifying the libraries? In this scenario, vantage6 is the mechanism to use these libraries over distributed centers in a secure way.&lt;/p&gt;
&lt;p&gt;However, all these libraries require addresses (IP + port) to communicate with the other parties. This is something that lacked in vantage6. In version 3, we have implemented a port and ip protocol using EduVPN. This worked perfectly (thanks Djura, Lourens and Bart), but we later realized that this mechanism, which essentially installs a VPN client in the node, could be considered a backdoor by system administrators.&lt;/p&gt;
&lt;p&gt;To establish node-to-node communication, we therefore need a more transparent and manageable implementation.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Instead of implementing a complicated node-to-node VPN network protocol we can rely on service mesh solutions like Istio to handle secure mutual TLS traffic between algorithm containers in different centers. This would be transparent and much more reliable than building our own solution.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="specialized-computation-jobs"&gt;
&lt;h3&gt;Specialized computation jobs&lt;/h3&gt;
&lt;p&gt;Many different algorithms are used within vantage6, from traditional statistical models to machine learning and image processing. These require different computation resources. Vantage6 in its current form requires the computation to take place on the same machine as where the vantage6 node is running. This requires you to install the vantage6 node directly into the machine that has sufficient resources (GPU, CPU, RAM).&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Using Kubernetes and its capability to span multiple workers (HPC) we can start Kubernetes jobs (vantage6 algorithm computations) in different machines with different hardware configurations.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="external-data-sources"&gt;
&lt;h3&gt;External data sources&lt;/h3&gt;
&lt;p&gt;In early versions of vantage6, we could only connect to file-based databases which were copied into the node. In the past couple of releases of vantage6, we have worked on ways to connect the vantage6 node instance to external data sources. For this purpose we have released the SSH tunnel, whitelisting and docker-services features.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Even though these features work and provide the access, Kubernetes provides similar, though more advanced, functionality that is supported by a much larger community. Using such battle-tested and transparent technology should be accepted with more confidence by system administrators than an implementation built by a small team such as ours. Also, maintaining code around the Kubernetes communication protocols is less work managing these features ourselves in the vantage6 code base - giving us more time to work on other features.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="alternative-container-technologies"&gt;
&lt;h3&gt;Alternative container technologies&lt;/h3&gt;
&lt;p&gt;Docker containers are not always the most appropriate technology to run containers, especially when considering security and privacy. Describing the differences is outside of the scope of this blog post, but there are valid reasons to choose one from:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;WebAssembly&lt;/li&gt;
&lt;li&gt;Apptainer (formely known as Singularity)&lt;/li&gt;
&lt;li&gt;Podman&lt;/li&gt;
&lt;li&gt;Docker&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Instead of implementing each container technology API separately we can use the Kubernetes API to start a container regardless of the underlying engine.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="container-health"&gt;
&lt;h3&gt;Container Health&lt;/h3&gt;
&lt;p&gt;We all hate dying nodes, it requires to visit the data station and diagnose what went wrong. Often, we just perform a restart and move on.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Kubernetes would allow us to check the health of a container (by our own defined health checks) and restart by policy if needed.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="resource-management"&gt;
&lt;h3&gt;Resource Management&lt;/h3&gt;
&lt;p&gt;The vantage6 node will spin up unlimited containers until it starts breaking down, resulting in connection loss and a dead node.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Kubernetes monitors resources and has mechanisms to deploy only when there is sufficient resources available. So this deadlock will be automatically prevented.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="simplified-deployment"&gt;
&lt;h3&gt;Simplified Deployment&lt;/h3&gt;
&lt;p&gt;Deploying a server in vantage6 can be quite cumbersome, there are many components that need to be installed and configured separately.&lt;/p&gt;
&lt;p&gt;On the node, system administrators want to have a better insight in how containers interact with one another using a standardized tool.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;Using helm charts to deploy the server in Kubernetes should drastically simplify the server deployment process.&lt;/p&gt;
&lt;p class="last"&gt;Kubernetes provides many tools for administrators for monitoring and managing the application. They therefore no longer require in-depth vantage6 knowledge to analyze container interaction.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="development-environment"&gt;
&lt;h3&gt;Development environment&lt;/h3&gt;
&lt;p&gt;Since we are dealing with many services, testing and developing changes became ever more difficult. We solved most of these issues by creating our own tooling to do so, but this requires effort to maintain and is not easily transferable to other developers.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;By using &lt;a class="reference external" href="https://devspace.sh/"&gt;devspace&lt;/a&gt; we can remove all our previous tooling for development and use a standardized way to share test and development environments.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="kubernetes"&gt;
&lt;h2&gt;Kubernetes&lt;/h2&gt;
&lt;p&gt;Now that we have established that Kubernetes is a must to continue the development of vantage6, we will explain why we will not support both options of using the Docker API and Kubernetes API.&lt;/p&gt;
&lt;div class="section" id="why-not-keep-docker-api"&gt;
&lt;h3&gt;Why not keep Docker (API)&lt;/h3&gt;
&lt;p&gt;Kubernetes is not an alternative for Docker. We still need an engine to run containers, and we see that in many cases this still will be Docker. The functionality provided by the Kubernetes API is different and much more extensive than the one of docker. Think about networking boxed in services, replication of containers, self-healing when a container is unhealthy, etc.&lt;/p&gt;
&lt;p&gt;In theory, we could implement these Kubernetes features ourselves, in fact, we did! For example, we have build a retry-mechanism to restart algorithm containers that have crashed. But we should not forget that we are a small development team with big ambitions. Why would we replicate something that is already out there, for free, created and maintained by 3760 contributors, used by the biggest companies in the world?&lt;/p&gt;
&lt;p&gt;The effort to support a Docker-based node would be large. We would essentially be creating a second vantage6 node without the features described above, as well as a duplicate CLI to manage both Kubernetes and Docker configuration. Having two versions means that:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;The entire vantage6 network needs to consider that there may be two different types of nodes&lt;/li&gt;
&lt;li&gt;We maintain twice as much as code for the node and the CLI&lt;/li&gt;
&lt;li&gt;Node features need to be implemented twice&lt;/li&gt;
&lt;li&gt;Bug fixes may need to be applied in two places&lt;/li&gt;
&lt;li&gt;Algorithm developers have to ensure their algorithm works on both a Docker node and a Kubernetes node&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These disadvantages make that maintaining both Docker and Kubernetes is not a viable option.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="kubernetes-installation-data-stations"&gt;
&lt;h3&gt;Kubernetes installation &amp;#64; Data Stations&lt;/h3&gt;
&lt;p&gt;Kubernetes can be installed in many ways. At one end you have the fully managed multi cluster Kubernetes that hosts many enterprise applications and on the other end you can already run a single cluster instance directly from Docker Desktop.&lt;/p&gt;
&lt;p&gt;When vantage6 is running using Kubernetes, it will be easy to support all of these scenarios. We consider a vantage6 node an edge device, so we don't need a big Kubernetes cluster. In its minimal form you install &lt;cite&gt;microk8s&lt;/cite&gt; + &lt;cite&gt;docker&lt;/cite&gt; in a VM, making the installation very similar to its current form.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="tech"></category><category term="kubernetes"></category><category term="docker"></category><category term="tech"></category><category term="vantage6"></category></entry><entry><title>A new approach for the BlueBerry registry using vantage6</title><link href="https://franky.codes/sarcoma-registry-update.html" rel="alternate"></link><published>2024-12-20T09:00:00+01:00</published><updated>2024-12-20T09:00:00+01:00</updated><author><name>Frank Martin</name></author><id>tag:franky.codes,2024-12-20:/sarcoma-registry-update.html</id><summary type="html">&lt;p class="first last"&gt;A new approach for the BlueBerry registry using vantage6&lt;/p&gt;
</summary><content type="html">&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="auto-toc simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#local-data-storage" id="toc-entry-1"&gt;1&amp;nbsp;&amp;nbsp;&amp;nbsp;Local Data Storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#researcher-user-interface" id="toc-entry-2"&gt;2&amp;nbsp;&amp;nbsp;&amp;nbsp;Researcher User Interface&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#future-work" id="toc-entry-3"&gt;3&amp;nbsp;&amp;nbsp;&amp;nbsp;Future work&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;p&gt;The &lt;a class="reference external" href="https://euracan.eu/registries/blueberry/"&gt;BlueBerry project&lt;/a&gt; was a two-year
initiative to develop a blueprint for a sustainable, scalable, and impactful data
infrastructure for rare cancers in Europe. In the context of
&lt;a class="reference external" href="https://iknl.nl/en/news/blueberry-is-now-really-taking-off!-building-a-blu"&gt;IKNL&lt;/a&gt;, I
have been involved in extending the &lt;a class="reference external" href="https://vantage6.ai"&gt;vantage6&lt;/a&gt; software to be
able to connect to &lt;a class="reference external" href="https://www.ohdsi.org/data-standardization/"&gt;OMOP data sources&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;When the project finished in September 2024, it was decided to continue with the
registry to use it for research. However, several challenges needed to be addressed
before it could be used for research:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;The user interface that has been developed for vantage6 lacked the components that
made working with the OMOP data source easy. It still required an engineer to operate
the system.&lt;/li&gt;
&lt;li&gt;The computation of the output was rather slow as the original data source was visited
for each computation call. This included creating the cohort and querying the selected
features.&lt;/li&gt;
&lt;li&gt;Only &lt;a class="reference external" href="https://github.com/IKNL/v6-crosstab-on-ohdsi-py"&gt;Crosstabulation&lt;/a&gt; and
&lt;a class="reference external" href="https://github.com/IKNL/v6-kaplan-meier-on-ohdsi-py"&gt;Kaplan-Meier curve&lt;/a&gt; have been
extended to work in the registry. There were some experiments with the OHDSI tools,
but these were difficult to operate.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I was asked to work together with &lt;a class="reference external" href="https://www.biomeris.it/en/"&gt;BIOMERIS&lt;/a&gt; on
addressing these issues to enable researchers using the platform for gaining meaningful
insights.&lt;/p&gt;
&lt;p&gt;In this blog post, I will explain first how I address performance issue as this
influences how the user interface is designed. Then, I will explain how the user
interface is designed to support the workflow of the researcher.&lt;/p&gt;
&lt;div class="section" id="local-data-storage"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;1&amp;nbsp;&amp;nbsp;&amp;nbsp;Local Data Storage&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Typically in vanilla vantage6, the data is fetched from the data source for each
computation call. This made computations slow as the OMOP query was typically time
consuming. To speed up the computations, I decided to fetch the data once for each
cohort and store it local in the vantage6 node.&lt;/p&gt;
&lt;div class="uml docutils container"&gt;
&lt;pre class="code literal-block"&gt;
┌──────────┐   ┌────────────┐   ┌────────────┐
│ OMOP     │   │ Query      │   │ Local DB   │
│ Database ├──►│ Algorithm  ├──►│ Parquet    │
└──────────┘   └────────────┘   └────────────┘
&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;The &lt;tt class="docutils literal"&gt;Query Algorithm&lt;/tt&gt; is a vantage6 algorithm that is responsible for fetching the
data from the OMOP database. It creates the ATLAS cohort and reads the patient features.
The data is stored by this algorithm in a &lt;a class="reference external" href="https://parquet.apache.org/"&gt;Parquet&lt;/a&gt; file.
This Parquet file is then used by the other algorithms to perform the analytics.&lt;/p&gt;
&lt;div class="uml docutils container"&gt;
&lt;pre class="code literal-block"&gt;
┌────────────┐   ┌────────────┐   ┌───────────┐
│ Local DB   │   │ vantage6   │   │ Algorithm │
│ Parquet    ├──►│ Algorithm  ├──►│ Output    │
└────────────┘   └────────────┘   └───────────┘
&lt;/pre&gt;
&lt;/div&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;In the future, I would like to extend the system so that these Parquet files can also be
modified by the user. For example, the user can create new variables.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;There are some challenges with this approach:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;When a node is offline when a new cohort is created it will not be able to fetch the
data. In this case, the node will create the cohort data it comes online. The user
can work with the other nodes in the meantime.&lt;/li&gt;
&lt;li&gt;When the data source is updated, the Parquet files need to be updated as well. This
is currently a manual process as the user needs to trigger the Query Algorithm to
fetch the data again.&lt;/li&gt;
&lt;li&gt;The Parquet files need to have the same variables and the same value types for these
variables. This should be guaranteed by the &lt;tt class="docutils literal"&gt;Query Algorithm&lt;/tt&gt;. Especially when the
cohorts are not created at the same time (e.g. when a node was offline when it was
created).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When a node is offline when a new cohort is created it will not be able to fetch the
data. In this case, the node will create the cohort data it comes online. The user can
work with the other nodes in the meantime.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;An additional benefit of this approach is that algorithms do no longer have the
logic to fetch the data from the OMOP database. So the vantage6 community algorithms
can be used without (much) modification.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="researcher-user-interface"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;2&amp;nbsp;&amp;nbsp;&amp;nbsp;Researcher User Interface&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The official vantage6 User Interface (UI) is developed as a general-purpose vantage6 UI.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="vantage6 user interface" src="https://franky.codes/images/sarcoma/screenshots-v6-ui.png" style="width: 800px;" /&gt;
&lt;p class="caption"&gt;The official vantage6 user interface from vantage6 (from &lt;a class="reference external" href="https://vantage6.ai"&gt;https://vantage6.ai&lt;/a&gt;).&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;If a new feature is to be added in this interface, it needs to be compatible with other
projects from the community as well. This has two major disadvantages:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;It feels overcomplicated for the user as it contains features that are not relevant
for the BlueBerry registry and it is not tailored to the workflow of the researcher.&lt;/li&gt;
&lt;li&gt;Adding new features to the UI is time-consuming as it needs to be compatible with
other projects and requires approval from the vantage6 community.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For these two reasons, I decided it would be better to create a separate, dedicated UI
for the BlueBerry registry. This way, I can tailor the workflow exactly as it should be
and I don't have to consider other projects when adding new features.&lt;/p&gt;
&lt;div class="admonition important"&gt;
&lt;p class="first admonition-title"&gt;Important&lt;/p&gt;
&lt;p&gt;As the proposed dedicated UI is aimed to support the workflow of the researcher, it
is not going to contain all the features that the official vantage6 UI has. The
official vantage6 UI is still available for the BlueBerry registry. It is possible
to switch between the two UIs.&lt;/p&gt;
&lt;p class="last"&gt;For instance, the official vantage6 UI is still used for the management of the
collaborations and studies.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;To accelerate development, I used &lt;a class="reference external" href="https://streamlit.io/"&gt;Streamlit&lt;/a&gt;. This framework
brought the following advantages:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;It minimizes the need to write front-end code as the front-end code, as it is
generated from Python code.&lt;/li&gt;
&lt;li&gt;It includes numerous built-in data science components like tables, graphs and
controls.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;However, it introduces an additional backend component, the one that renders the front
end. The app's appearance and components can be customized, however the customization
is very different from front-end frameworks like React or Angular.&lt;/p&gt;
&lt;p&gt;This newly developed UI aims to better support the researcher's workflow. The first
thing after logging in is to select the collaboration and optionally the study it wants
to work with. Once the collaboration/study is selected, the user can view the online
organizations within the collaboration or study. The user is at this point able to
create sub selections of the organizations it wants to work with.&lt;/p&gt;
&lt;div class="scrollx docutils container"&gt;
&lt;table border="1" class="docutils align-center"&gt;
&lt;colgroup&gt;
&lt;col width="50%" /&gt;
&lt;col width="50%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;Collaboration &amp;amp; Study Selection&lt;/th&gt;
&lt;th class="head"&gt;Node status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;&lt;div class="first last figure align-center"&gt;
&lt;img alt="users can select their collaboration and study" src="https://franky.codes/images/sarcoma/collaboration_and_study.jpeg" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;Users first need to select the collaboration and optionally the study they
want to work with. Some metadata is shown about the selected collaboration
and study.&lt;/p&gt;
&lt;/div&gt;
&lt;/td&gt;
&lt;td&gt;&lt;div class="first last figure align-center"&gt;
&lt;img alt="users can check the status of the nodes" src="https://franky.codes/images/sarcoma/node_status_redacted.jpeg" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;Once the collaboration is selected, the user can view the online
organizations. It is possible to create a sub selection of the organizations
the user wants to work with.&lt;/p&gt;
&lt;/div&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;Once the organizations are selected, the system checks which cohorts are available for
the selected organizations. The UI then determines automatically which cohorts are ready
for analysis, it validates that:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;All the (online) organizations have the cohort available.&lt;/li&gt;
&lt;li&gt;The minimal number of patients threshold is met at each organization.&lt;/li&gt;
&lt;li&gt;All the organizations have the same variables and have the same value types for these
variables.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;By default, all the &lt;em&gt;healthy&lt;/em&gt; cohorts are selected. The user can also make a sub
selection of the cohorts it wants to work with. It is also possible to create a new
cohort based on the &lt;a class="reference external" href="https://atlas-demo.ohdsi.org/"&gt;ATLAS&lt;/a&gt; cohort definitions.&lt;/p&gt;
&lt;div class="scrollx docutils container"&gt;
&lt;table border="1" class="docutils align-center"&gt;
&lt;colgroup&gt;
&lt;col width="50%" /&gt;
&lt;col width="50%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;Cohort selection&lt;/th&gt;
&lt;th class="head"&gt;Cohort creation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;&lt;div class="first last figure align-center"&gt;
&lt;img alt="users can select the cohorts they want to work with" src="https://franky.codes/images/sarcoma/healthy_cohorts.jpeg" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;Users can select the cohorts they want to work with. By default, all the
healthy cohorts are selected. In this case none of the cohorts are healthy.&lt;/p&gt;
&lt;/div&gt;
&lt;/td&gt;
&lt;td&gt;&lt;div class="first last figure align-center"&gt;
&lt;img alt="users can create a new cohort" src="https://franky.codes/images/sarcoma/healthy_cohorts_2.jpeg" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;Before the user can continue all the selected organizations need to have the
cohort available. The user is able to select the cohorts and from there
automatically select the organizations that passed the validation.&lt;/p&gt;
&lt;/div&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;Once the cohorts have been selected the user can continue to the analytics part of the
application. The first analytics that is available is the summary statistics. This gives
an overview of all selected cohorts and its variables. It reports some basic statistics
like missing, mean, standard deviation, etc.&lt;/p&gt;
&lt;p&gt;The second analytics that is available is the crosstabulation. This is a useful tool
to compare the distribution of two categorical variables. The user can select the
variables it wants to compare and the crosstabulation is calculated for all selected
cohorts.&lt;/p&gt;
&lt;p&gt;The third analytics that is available is the Kaplan-Meier curve. This is can be used
to compare the survival between cohorts. The dataset contains the survival time and
the event indicator, so these are already preselected.&lt;/p&gt;
&lt;div class="scrollx docutils container"&gt;
&lt;table border="1" class="docutils align-center"&gt;
&lt;colgroup&gt;
&lt;col width="33%" /&gt;
&lt;col width="33%" /&gt;
&lt;col width="33%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;Summary statistics&lt;/th&gt;
&lt;th class="head"&gt;Crosstabulation&lt;/th&gt;
&lt;th class="head"&gt;Kaplan-Meier curve&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;&lt;div class="first last figure align-center"&gt;
&lt;img alt="users can view the summary statistics of all selected cohorts" src="https://franky.codes/images/sarcoma/summary_stats.jpeg" style="width: 266px;" /&gt;
&lt;p class="caption"&gt;Users can view the summary statistics of all selected cohorts. The summary
statistics are calculated for all selected cohorts.&lt;/p&gt;
&lt;/div&gt;
&lt;/td&gt;
&lt;td&gt;&lt;div class="first last figure align-center"&gt;
&lt;img alt="users can compare the distribution of two variables" src="https://franky.codes/images/sarcoma/crosstabs.jpeg" style="width: 266px;" /&gt;
&lt;p class="caption"&gt;Users can compare the distribution of two variables. The crosstabulation is
calculated for all selected cohorts.&lt;/p&gt;
&lt;/div&gt;
&lt;/td&gt;
&lt;td&gt;&lt;div class="first last figure align-center"&gt;
&lt;img alt="users can compare the survival of two cohorts" src="https://franky.codes/images/sarcoma/kaplan_meier.jpeg" style="width: 266px;" /&gt;
&lt;p class="caption"&gt;Users can compare the survival of two cohorts. The Kaplan-Meier curve is
calculated for all selected cohorts.&lt;/p&gt;
&lt;/div&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="future-work"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;3&amp;nbsp;&amp;nbsp;&amp;nbsp;Future work&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;This project is still in development throughout 2025. There are still several features
that need to be added to the system. The following features are planned:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;The current algorithms need to be extended to support additional features like
stratification.&lt;/li&gt;
&lt;li&gt;Currently in development are some more advanced analytics like the Cox proportional
hazard model and the propensity score matching.&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;In the future the &lt;a class="reference internal" href="#local-data-storage"&gt;Local Data Storage&lt;/a&gt; will be no longer be necessary as this
feature will be build into the vantage6 core (This feature is called sessions and
is available from &lt;a class="reference external" href="https://github.com/vantage6/vantage6/issues/943"&gt;version 5+&lt;/a&gt;).&lt;/p&gt;
&lt;p class="last"&gt;This might be added to the final stages of the project.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="vantage6"></category><category term="python"></category><category term="vantage6"></category><category term="OHDSI"></category><category term="streamlit"></category></entry></feed>