Could you hire me? Contact me if you like what I’ve done in this article and think I can create value for your company with my skills.

July 18, 2022 / by Zsolt Soczó

The SQL Server, which thought it was a balloon – Part 2 – the root cause

In the previous part, I introduced a strange problem. A SQL Server occasionally used an excessive amount of memory, much more than the machine’s RAM had.

What was unusual was that the sum heap size of SQL Server was 4 GB only, but it reserved and committed 39GB memory from Windows.

The last part ended by observing that the SQL Server process contained a 3rd party DLL. Third parties can cause memory corruption inside SQL Server. This particular component is an antivirus component that can cause problems if buggy. However, you will see NOT this component was responsible for the failure. Another one was. How do I know? We did not remove this antivirus software from the machine, and the error disappeared from another change I’ll describe soon.

I checked SQL Server logs.

I examined the default trace, but it contained nothing interesting.

I read the default Extended Event-based system_health session, but nothing was wrong.

Then I focused on SQL Server logs. One of the SQL Server logs was 10GB. I could not read it via SQL Server Management Studio because it reads the log entirely in memory, and SSMS is a 32-bit process that cannot store more than 1.5GB of memory content.

When I have to read large logs (and I often do), I like to use the Viewer of the Far Manager. It looks outdated, but it has excellent features for my investigations.

SQL Server error log contained millions of “Failed to allocate pages: FAIL_PAGE_ALLOCATION 1” errors. This repeated error was responsible for the large log size. However, this error message was just a consequence of the excess memory usage.

But there was an even more exciting thing in the log: crashes. The SQL Server experienced exceptions and created memory dumps several times. The dumps themselves did not add new information to the investigation. But some of the text files created beside the dumps do.

Crash dump details

The log shows the statement which caused the SQL Server process exception. The orange area contained the SQL statement, but I had to cover it to protect the client’s IP. It was a select statement from a view. The view refers to a linked server pointed to an Oracle database.

I ran the exact statement with the same parameters, but it did not crash the server. Then I wrote an SQL Server cursor that iterated over all possible parameters, and I dynamically ran the selects in 6 SSMS sessions continuously. I wanted to see if I was able to provoke the error. After an hour, nothing happened. I went to sleep desperately. The problem seemed insurmountable. I was afraid that I could not catch this sucker.

The following day the client said the memory ballooning happened again, and they had to restart the server. I rushed to the server to check the logs. And there were several crashes at night! :)

At this point, it became very probable that the crashes themself caused the server to accumulate excessive memory. Why? I don’t know. I don’t have to know what happens inside the SQL Server after a buggy component screws up the memory.

I asked the client to delete all Oracle-based linked servers to avoid accidentally using the buggy Oracle drivers. It seems Oracles likes to leak, according to very recent articles. And when they do, they do in big. I quote: “Huge Memory Consumption With Oracle Client 12.2 Using OraOLEDB”. :)

A month has elapsed, and there have been no crashes and memory issues so far. I cannot claim 100% that the issue is resolved, but it probably does. Time will tell.

How can the client avoid this issue in the future?

  1. They can wait for a not-so-buggy driver from Oracle.
  2. Make the data pump out of the process from SQL Server. For example, they can write a .NET console app directly connecting to the Oracle data source and the SQL Server destination. The app reads the data from Oracle and writes it to SQL Server via BulkInsert. If the driver eats the memory, they can restart the console app. This solution is just mitigation and needs more effort, but at least it won’t destabilize the SQL Server in the future.

Moral of the story: do not let in-process 3rd party components ruin your SQL Server. Especially if a direct competitor writes them. :)

Could you hire me? Contact me if you like what I’ve done in this article and think I can create value for your company with my skills.