In a previous article, I wrote how Sidra relates to Azure Data Factory Self-Hosted Integration Runtimes (SHIR) and considerations on how to set up a SHIR node. In short:…
Sidra is using Data Factory pipelines to copy data from source servers into the Data Lake. The data extraction activities of these pipelines are copying records from source DBs into Parquet files, saving them as blobs in an Azure Storage container. These copy activities may be “pushed” on SHIR nodes for execution and the IR agent would use Java libraries to generate the Parquet files. The Integration Runtime (IR) agent is executing Data Factory activities in worker processes, diawp.exe. These worker processes, in turn, would create java.exe processes to execute the code from the Parquet-handling libraries.
Java’s memory cap on heaps
Java will use a limited amount of memory for heaps, even if the IR node machine may have plenty of RAM. From what I noticed, Java caps its heaps to a quarter of the installed RAM on the machine: for instance, on an 8 GB RAM machine, Java would use only maximum 2 GB, by default. To check, issue the following command and look for the values MaxHeapSize or SoftMaxHeapSize:
java.exe -XX:+PrintFlagsFinal -version | findstr /i "HeapSize"
What about this heap memory cap that Java is enforcing? Too small a cap may lead to the OutOfMemoryError; it’s not that there isn’t enough RAM, but there is no free slot large enough in the heap for a new object allocation. And too small a cap will lead to high frequency of Garbage Collection; meaning lower performance and high-CPU usage, although there may be sufficient RAM. I wrote about it in my previous article: it may be fixed with a system environment variable like below:
JAVA_TOOL_OPTIONS = -Xms512m -Xmx16g
Above, -Xms tells the starting size and -Xmx tells the maximum size, in MB (m) or GB (g) for the heaps.
Set it under half of RAM
Why not set the maximum heap size to the amount of RAM available on the machine? Well, we could; after all, the Windows memory manager can handle it. The committed memory of a process may be much larger than the working set, the bytes that are actually in the RAM. But increasing the heap to more than half the RAM may trigger another performance problem: a memory pressure on the OS will cause frequent Garbage Collections in the .NET runtimes too. While Java code is handling Parquet files, the worker processes of IR agent also need memory.
I would keep the Java maximum heap size somewhere between a ¼ and ½ of the available RAM on the IR node.