Notes on Tuning Apache NiFi

Russell Bateman
10 April 2023
last update:

Random notes on tuning NiFi flows...

Modernly, most discussions of flow tuning involve a NiFi cluster of nodes. I speculate that adapating the observations to a single-node installation is still useful. Setting up, managing and maintaining clustered NiFi is not the easiest thing to do. Graduating from single-node NiFi to a cluster is a quantum-sized leap.
Use the NiFi Summary UI to identify the number of threads used per processor. This is reached from the General (hamburger) menu, Summary. The summary is global (doesn't change scope just because you're currently down inside a process group). It provides tabs of summaries for (or by)
- Processors
- Input Ports
- Output Ports
- Remote Process Groups
- Connections (relationship arcs)
- Process Groups
How to find the indices evoked in discussions of tuning (UI)? (Some of this works differently depending on whether clustered or not):
- thread-pool size —
- active-thread count —given as the leftmost figure on the second banner (top-level UI)
- available cores —see below in system diagnostics
- core load average —see below in system diagnostics
- number of concurrent tasks —set in Configure Processor → Scheduling → Concurrent Tasks
Cloudera: Turning your data flow, published 2019-06-26, last updated, 2022-12-13. This document discusses
1. The size of flowfiles, large or small, can influence how to tune flows. There is nothing built-in to isolate flowfiles by size; most often, tuning should be done in a global way ensure no side effects between data flows.
2. The first and primary bottleneck in NiFi is disk I/O operations. Distribute especially the flowfile content, flowfile metadata, and, to a lesser degree, the provenance repositories to dedicated disk volumes. Obviously, SSDs will be far more performant than spindles.
3. Viewing various values in the flow
  - Number of threads (in a cluster)
  - Number of active threads
  - Number of cores
  - Configuring thread-pool size
4. Concurrent tasks
5. Run duration
6. Recommendations
  - Adjust the thread-pool size to be based on the number of cores.
  - Do not increase the thread-pool size unless the number of active threads is observed to remain equal to the maximum available threads. (If a three-node cluster with eight cores per node, do not increase from 24 unless the active thread count, displayed in the UI, is frequently seen to be around 72.)
  - If the count of active threads is frequently equal to the maximum number of available threads, review the core load average on the nodes.
  - If the core load average is below 80% of the number of cores and the count of active threads is at its maximum, slightly increase the thread-pool size. Begin by increasing the pool size by n+1 times the number of cores where n is the current value. Keep the load average around 80% of the number of cores to account for the loss of one node and you want some resources to remain avalable to process the additonal amount of work on the remaining nodes.
  - For the number of concurrent tasks, if
    - there is back-pressure somewhere in the workflow,
    - the load average is low,
    - the count of active threads is not at the maximum,
    consider increasing the number of concurrent tasks where the processors are not processing enough data and are causing back-pressure.
    
    Increase the number of concurrent tasks iteratively by only 1 each time while monitoring
    - how things are evolving globally, in particular, how the thread pool is shared across all the workflows of the cluster,
    - the load average, and
    - the active thread count.
    If a processor (on its right side, the number of active threads is displayed), has active threads and is not processing data as fast as expected while the load average on the server is low, the issue could be related to disk I/O operation. Check the I/O statistics on the disks hosting the NiFi repositories (content, flowfile and provenance).
It is a good practice to start by setting the Timer Driven Thread Pool size number equal to three times the number of cores available on your NiFi nodes. For example, if you have nodes with eight cores, you would set this value to 24.
Concurrent Tasks increases how many flowfiles are processed by a single processor by using system resources that then are not usable by other Processors. This provides a relative weighting of processors—it controls how much of the system’s resources should be allocated to this processor instead of other processors.

This field is available for most processors. There are, however, some types of processors that can only be scheduled with a single concurrent task. It is also worth noting that increasing this value for a processor may increase thread contention and, as a result, setting this to a large value can harm productivity.

As a best practice, you should not generally set this value to anything greater than 12.

NiFi is optimized to support flowfiles of any size. This is achieved by never materializing the file into memory directly. Instead, NiFi uses input and output streams to process events (there are a few exceptions with some specific processors). This means that NiFi does not require significant memory even if it is processing very large files. Most of the memory on the system should be left available for the OS cache. By having a large enough OS cache, many of the disk reads are skipped completely. Consequently, unless NiFi is used for very specific memory oriented data flows, setting the Java heap to 8 or 16 Gb is usually sufficient.

The performance expected directly depends on the hardware and the flow design. For example, when reading compressed data from a cloud object store, decompressing the data, filtering it based on specific values, compressing the filtered data, and sending it to a cloud object store, the following results can be reasonable expected:

Nodes	Data rate per second	Events per second	Data rate per day	Events per day
1	192.5 Mb	946,000	16.6 Tb	81.7 billion
25	5.8 Gb	26 million	501 Tb	2.25 trillion
150	32.6 Gb	141.3 million	2.75 Pb	12.2 trillion

The characteristics of the above include: (per node)

32 cores
15 Gb RAM
2 Gb heap
1 Tb persistent SSD (400 Mb/s write, 1200 Mb/s read)

At a minimum: (per node)

4 cores
6 disks disk dedicated to repositories (including mirroring)^*
4 Gb heap
1 Tb persistent SSD (400 Mb/s write, 1200 Mb/s read)

See the full exposé by Mark Payne here.

^* Why 6 disks dedicated to the NiFi repositories? Because one disk and one mirror for each of content, flowfile and provenance. In that order of priority and return on investment. Leave out the database (4th) repository? It's only the user data (keys, who's logged in, etc.) plus a database to track configuration changes made in the UI.

Single-node NiFi differences...

From the NiFi UI of a single-node installation, there is no menu choice, Cluster, under the General (hamburger) menu. What there is is obtained under General → Summary, then clicking system diagnostics at lower right of the window. This yields tabs that lead to alerts for

JVM

Heap (71.0%)                                  Non-heap

Max:                                          Max:
512 MB                                        -1 bytes

Total:                                        Total:
512 MB                                        261.2 MB

Used:                                         Used:
362.02 MB                                     245.48 MB

Free:                                         Free:
149.98 MB                                     15.72 MB

Garbage Collection

GI Old Generation:
0 times (00:00:00.000)

GI Young Generation:

Runtime
792 times (00:00:01.546)

Uptime:
87:14:57.964

System

Available Cores:                              Core Load Average: (for the last minute)
16                                            20.02

Version

System Diagnostics
------------------
NiFi
NiFi Version    1.13.2
Tag             nifi-1.13.2-RC1
Build Date/Time 03/17/2021 20:45:53 MDT
Branch          UNKNOWN
Revision        174938e
Java
Version         11.0.10
Vendor          AdoptOpenJDK
Operating System
Name            Linux
Version         5.4.0-146-generic
Architecture    amd64

Home-grown, practical, remedial hypotheses to test

Make processors that process next to nothing by weight get all the cycles they want.
- UpdateAttributes
- RouteOnAttributes
Weigh instead the heavy processing in the balance bridling their leisure, applying back-pressure, etc. for the big guns (heavy processors).