Incompetent FireBird Developer Woes

A client of ours (45 workstations) has practice management software that uses FireBird SQL Server as its back-end database. It stores all of their patient, billing, claims, scheduling, forms, and procedure data. Typical usage of the application by the front desk staff is checking in/out, scheduling, and collecting money from patients. They all pretty much reference the same data all day long. Typical usage in the back procedure area is pulling up a patient’s x-rays, history, and entering new procedure data. A couple of years ago, they started to notice their software was slowing down more and more.

Some History

They initially had two servers. One was the domain controller that also housed patient x-rays, practice documents, and their orthodontic software. The other ran their practice management software database (FireBird). We provided the domain controller and the other server came from their software provider.

Two years ago, we upgraded their old domain controller to a setup of two Ubuntu servers that use DRBD to replicate a partition holding QEMU/KVM virtual machine images from the primary to the secondary. This allows us to quickly restore services in the event of a catastrophic hardware failure on the primary server by simply setting the secondary as the primary and turning on the VMs. (I’ll do a how-to on this setup soon) When we initially engineered this project, we didn’t intend to virtualize their database server, but left plenty of overhead.

The Trouble Starts

After about six months of running great on the new domain controller, the office starts to notice slowness in their practice management software. The drives in the older database server were failing and the database was getting larger. We decided to offer to virtualize the old database server (Server 2003) and put it on our servers as we knew we had the overhead to support it. I allocated four cores and 4GB of memory to the server and it was running great for a few more months.

They started seeing more slowness as their database grew to over 30GB. After monitoring the resources on the newly virtualized server, I noticed all of the memory was being consumed, but no processes were reporting using more than 100MB. I called their software vendor and they explained that we were using an unsupported, non-standard platform and that had to be the cause. So I took some time and revioewed my QEMU/KVM settings and made sure I wasn’t missing anything performance-wise. Another call to the software company and they explained that sometimes the firebird processes don’t correctly show the amount of memory they’re using.

Something Isn’t right here…

The slowness continued to get worse and I was rebooting the virtual machine every lunchtime and night as the only option. The software company continued to blame our host machines and wouldn’t help diagnose the issue, even though the client had support with them. In an attempt to speed things up, I decided to create a new Server 2008 R2 virtual machine and migrate the database to it. I allocated six cores and 16GB of memory to the new VM. The software company did the database migration after I setup the server for them.

Initially, the performance was much better. After a few months though, it started slowing and I noticed the server was out of memory, again. I also noticed that there were roughly 160 firebird server processes running during the day. The software company explained that firebird creates a process for every connection to it that houses its own cache for each connection as well. I thought that seemed a bit inefficient, but figured since they’re the developers, they know best.

That’s where I went wrong

I started doing some research into FireBird and how it works and why it acts like it does. I started getting more familiar with the command line tools to monitor it and checked the version. As of two weeks ago, they were running 2.1.3 x86. I immediately started questioning the software company about why they were using a really old version. They said that’s the version they install on ALL of their servers. So I started digging deeper and noticed that in that version of FireBird, there is a known bug with it and 64bit OS’s. Specifially, how the system manages FireBird’s cache. I brought this to the attention of the software company and they said they couldn’t update the version of FireBird or it would break the software. So back to rebooting twice a day while I researched more options.

I started to read techniques for optimizing FireBird and realized that the software company had left all of FireBird’s options at default values. I approached them again and they said they had no idea what I was talking about and that if I changed settings, things would break. The continued to blame our hardware and started proposing a new server to the client. They also claimed that my client was the only one having these issues.

Not on my watch!

I decided it was time to start testing some of these optimization changes after taking snapshots of the VM. I started messing with the page buffers and the write-cache mode. Magically, the software still worked and was slightly faster. It still used all of the memory though. I decided to contact a buddy of mine who develops an application using FireBird as its back-end as well; and he gave me some pointers on page buffers and other settings.

During this time, the client got fed up with the software company and demanded a list of their clients with similar sizes so they could call and ask them about their performance. After some calling around, they figured out that they were not the only client with this issue and an even larger office had “figured out how to fix it”. They gave me the contact information to that office and I had a long, interesting talk with their IT administrator. Turns out, they fixed the problem by purchasing a MASSIVE enterprise class server. $15k later, the software would run fast and stable. One important note is that they had 120GB of memory, and 70GB of it was being used by FireBird. Their entire database was residing in the memory, which is why it “solved their problem”.

avoiding spending $15k on a software problem

Finally, I got fed up with waiting for a solution and decided on a list of changes we were going to make, regardless of what the software developer said. First thing we did was upgrade FireBird to current x64. We then switched to the superserver mode which uses one FireBird process and shares its cache with all connections to it. With this, we had to configure the FireBird process to use more than one CPU, which was an easy option in the firebird.conf. Next, we changed the database page size to 16KB, increased the maximum page buffers to 650,000, and told the database to use asynchronous write-caching . Then we changed the option for FireBird to use file-system cache or FireBird’s cache. We opted for file-system cache only. In some cases, FireBird will do both which essentially doubles its memory usage for no reason. Finally, we set the FireBird database to sweep its old transactions every 100,000 transactions instead of every 20,000. Sweeping eats CPU resources so we wanted to do it less frequently. They do 100k transactions in about an hour.

Today was the first day after making all of these changes and they are faster than they ever have been. Their software went from taking 16 minutes to open and be ready to use to 40 seconds. FireBird is using only 6GB of memory and about 20% CPU. (I decided to start small on the memory and increase it as we tested more) They have run all day without a single failure or slow-down. In the next couple of days, I’ll do a quick write-up of exactly how we made the changes we did and how to check the current settings. Here is the write-up on how we made our changes and how to check the current settings.

TL;DR:

Software company setup their database using defaults and had no idea how to optimize it. Instead of taking them at their word, we did research and made changes on our own which improved speeds exponentially.

Read 3 comments

  1. Pingback: Awesome write up on investigation vs. more hardware | Patrick Pierson

  2. Pingback: Firebird News » Software performance investigation vs throwing more hardware

  3. Pingback: PageBuffers werden nicht richtig angewendet - Delphi-PRAXiS

Leave a Reply