Notes on Concurrency and Parallelism in Python

September 23, 2025 ☼ Python

I recently developed a Python service that relied heavily on concurrency and parallelism to handle multiple tasks simultaneously. This experience highlighted the importance of understanding the differences between these two concepts and how they can be effectively implemented in Python.

Below you find some of my research notes and thoughts on the topic.

The architecture

Before jumping into the notes I want to sketch out a simplified version of the architecture of the system that I helped developing.

Please bear in mind that this is a high-level overview and some details have been omitted for brevity.

The main goal of the Service is to parse PDFs and extract their content using LLMs to do so (Data Processor). After the content is extracted it will be recalled and some numbers crunching will be done on it (Data Analyser).

Multiple instances of both Data Processor and Data Analyser are scaled up and down based on the number of items in the queue. We used KEDA ScaledObject for that.

Everything is running on SAP BPT Kyma runtime.

SAP BTP, Kyma runtime provides a fully managed cloud-native Kubernetes application runtime based on the open-source project “Kyma”.

┌─────────────────────────────────────────────────────────────┐
│  K8S (Kyma Runtime)                                         │
│                                                             │
│                                                             │
│                                                             │
│         ┌─────────────────────────────────────────┐         │
│         │                                         │         │
│         │               NATS Broker               │         │
│         │                                         │         │
│         └───────▲────────────────────────▲────────┘         │
│                 │                        │                  │
│                 │                        │                  │
│             Publish /                 Publish /             │
│             Subscribe                 Subscribe             │
│                 │                        │                  │
│                 │                        │                  │
│                 ▼                        ▼                  │
│        ┌─────────────────┐      ┌─────────────────┐         │
│        │┌────────────────┴┐     │┌────────────────┴┐        │
│        ││┌────────────────┴┐    ││┌────────────────┴┐       │
│        │││                 │    │││                 │       │
│        └┤│ Data Processor  │    └┤│ Data Analayser  │       │
│         └┤                 │     └┤                 │       │
│          └─────────────────┘      └─────────────────┘       │
│                   │                        │                │
│                   │                        │                │
│                   │                        │                │
│         ┌─────────▼────────────────────────▼───────┐        │
│         │                                          │        │
│         │               SAP AI Core                │        │
│         │                                          │        │
│         └──────────────────────────────────────────┘        │
│                                                             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Types of workload

When designing the system we looked into what types of workloads we will have to deal with. Here we have both I/O and CPU Bound workloads.

I/O Bound workloads are those that spend more time waiting for I/O operations to complete than using the CPU. Examples include reading and writing files, making network requests, and interacting with databases. In our case, the Data Processor service is primarily I/O Bound as it needs to read and write large PDF files and make API calls to SAP AI Core (where we have LLMs deployed).

CPU Bound workloads, on the other hand, are those that require significant CPU resources to process data. These workloads benefit from parallelism and can be distributed across multiple CPU cores. The Data Analyser service is an example of a CPU Bound workload as it performs “complex” data analysis and number crunching on the extracted content.

How do we handle these workloads in Python?

Python provides several libraries and modules to handle concurrency and parallelism, each with its own strengths and weaknesses.

Before we can talk about specific libraries we need to understand how the language handles concurrency and parallelism at a fundamental level.

Concurrency / Parallelism

My mental model for thinking about the above is:

Concurrency is about dealing with lots of things at once. If your algorithm can be divided into several subparts that can be executed independently, leading to the same result regardless of the order of execution then you can use concurrency to speed it up.
Parallelism is about doing lots of things at once. This typically involves running multiple processes or threads simultaneously to take advantage of multiple CPU cores. Threads are the smallest units of execution within a process, and a process can have multiple threads that share the same global data and resources.

Notes on Python

Python has a mechanism (GIL) that prevents multiple threads from executing Python bytecodes simultaneously. Even in a multi-threaded architecture with more than one CPU core!

For CPU-bound workloads, using multi-threading in Python does not improve performance because the GIL prevents multiple threads from running Python code in parallel.
Threads end up running one after another, and the extra overhead from switching between threads can actually make things slower than using a single thread.
Therefore, multi-threading is generally not an effective way to speed up CPU-bound tasks in Python.

Hence we often turn to multi-processing or other parallelism techniques to fully utilize the available CPU cores. While multi-processing can effectively distribute CPU-bound workloads across multiple cores, it introduces complexity in data communication between processes due to the lack of shared state.

Multi-Threading can also be used for IO-bound workloads: can improve performance by allowing one thread to execute while another is blocked waiting for IO operations to complete (threads can yield control to the scheduler when waiting for IO).

For IO-bound workloads we should use Asynchronous programming (cooperative multitasking), allowing us to specify where tasks can yield control for better performance. To effectively use async programming, third-party libraries must also support asynchronous operations to avoid blocking the event loop. This is important and often overlooked.

Libraries used for Asynchronous Programming

Awesome asyncio

If you have any suggestions, questions, corrections or if you want to add anything please DM or tweet me: @zanonnicola