FireBird Performance Tweaking

Last week, I posted a story about my interaction with a software developer who uses FireBird (FB) as a back-end for their software. The short version is that they left FB’s settings at default and were using a really, REALLY old version of FB. This caused the application to have massive (18 minute) load times and hang the entire server, with just one client connected. Below are the changes I made to increase overall performance of the application and server. These changes worked well for the software DOX by KSB Dental, but could potentially be used elsewhere. I will attempt to explain each option the best I can, but please keep in mind, I am not a FB database expert. I also edited the PATH variable to point to the FB install\bin folder (C:\Program Files\FireBird\bin, in my case) to make running the commands easier.

32bit VS 64bit

Initially, they were running FB (x86) 2.1.3 on Server 2003 (x86). We upgraded the server’s OS to Server 2008 R2 (x64) but left FB at 2.1.3 (x86). This made no difference in speed after the first week or so. Initially it did, but my guess is that the process of backing up and restoring the database re-indexed the data.

Another issue with 32bit is that there are limits within the OS and FB. Specifically, the fbserver process cannot use more than 2GB of memory and the maximum page buffers that can be used by a database is roughly 130k.

We then upgraded FB to 2.5.2 (x64). Just that switch alone doubled the speed of the application and we could have all 35+ clients connected at the same time.

The switch is pretty straight forward. Just use gback to backup the database, uninstall the old version, install the new version, and then restore using gback again.

One final thing to note is that until version 2.1.5, FB (x86) had a known bug with running on 64bit systems. Essentially, it would keep its own cache and use file system cache. Which means that everything cached was being cached twice. It would also continue using memory until the sever started swapping. It would do this without a single process on the server using more than 100MB of memory in the task manager.

Database Page Size

The default page size is 4KB. You can check to see what your database is using by running the gstat command.

gstat -h <path to DB>

We found that increasing our page size to 16KB increased performance a lot. (Almost double, again) In order to do this, the database needs to have a backup/restore done on it. When restoring, specify the -p switch and set it to 16384.

Classic Server VS SuperServer

There are two main differences between these options. The first difference is that in SuperServer mode, every client that connects to the FB server shares the same cache. For our application, this was a must as almost all of the clients were referencing the same data continuously. The second difference is that in Classic Server mode, every client that connects gets its own fbserver process. Our application sometimes opened 3-4 connections per client so we had roughly 180 fbserver processes running at a time.

Choosing which mode to run in is done during the FB installation.

One last thing to note is that in SuperServer mode, by default, the server will only utilize one processor. This can be changed in the firebird.conf file in the installation directory. The variable to set is CPUAffinity. We use four cores, so we set ours to 15. The following is a small guide on how to set this:

Each core is assigned a bit map value. For example core one is 1, core two is 2, core three is 4, and core four is 8. This keeps doubling for each core you have. (Hint: Binary) If we wanted to use just the last two cores, we would set it to 12. The first two? 3. Cores two and three? 6.

Page buffers

We decided to start by using 600k page buffers. This setting basically tells FB how much cache to use. The size of the cache can be determined by multiplying the page size by the page buffers. For example, our maximum cache size would be 9.2GB. We have since increased it to 800k. To set this, just use the gfix command and reboot the FB server:

gfix -buffers <# of page buffer> -user <DB username> -password <DB password> <Path to DB>
ex: gfix -buffers 600000 -user dbadmin -password secretpassword D:\dbname.gdb

cache types and limits

We opted to use file system caching instead of FB’s built-in caching. (Which we may end up changing later, we’re still undecided) To do this, we need to edit two variables within the firbird.conf file. The first is FileSystemCacheSize which is an integer defining what percentage of the system memory FB can use. We set this at 90% as FB is the only thing running on our server.

Next is FileSystemCacheThreshold. To set FB to always use file system cache, make this value way higher than your page buffers.

Both of these changes require a full server reboot, not just FB.

Write Caching

If you’ve ever done any RAID controller configuring, you may be familiar with Write Through and Write Back (Synchronous and Asynchronous) cache modes. By default, FB uses synchronous caching which means that the data gets written to the DB immediately. This is the best option for data validation in case of catastrophic hardware failure.

If you’re looking to increase speeds and have redundancy at the hardware level (RAID, iSCSI, clustering, battery backup, etc) you can use asynchronous caching which writes the changes to cache to be written to the DB when enough of these changes have been cached. We found this increased our speed quite a bit. To do this, simply use the gfix command and restart the FB server:

gfix -w async -user <DB username> -password <DB password> <Path to DB>
ex: gfix -w async -user dbadmin -password secretpassword D:\dbname.gdb

Sweeping

This is a tricky one. To understand sweeping and the reasons behind it, you have to understand how FB’s transactions work. I’ll try to be brief.

Every time the application interacts with FB (reads or writes) it creates a transaction. These transactions are categorized in two ways (that matter to us): interesting and commited. Once a transaction is started, it becomes interesting to FB. FB then marks it to be committed to the DB. until the application sends the commit command, the DB keeps a copy of the old data and the updated data, in-case it needs to roll back. What this does it keep a lot of extra data in the DB unless its being committed correctly.

FB has a built-in function called sweeping to collect the garbage “left open” changes that never got committed and clean them up. They call this garbage collection. By default, FB does it every 20k transactions. (The difference between the oldest snapshot and the oldest interesting transaction) Sweeping uses a good amount of CPU and memory and can bog the system down. We opted to remove the automatic sweeping and run manual sweeps during specific down times. (After-hours/Lunch) To do this, you’ll need to set the database to stop sweeping automatically with gfix and then reboot the FB server.

gfix -h <interval> -user <DB username> -password <DB password> <Path to DB>
ex: gfix -h 0 -user dbadmin -password secretpassword D:\dbname.gdb

Then you’ll need to setup a scheduled task to run the command to initiate a sweep.

gfix -s -user <DB username> -password <DB password> <Path to DB>
ex: gfix -s -user dbadmin -password secretpassword D:\dbname.gdb

Conclusion

After we made these tweaks to the FB database, we noticed an exponential difference in speed. We went from 18 minutes to open the application and very slow speeds within the application to 40 seconds to open and almost instant moving around within the application. Keep in mind that not ALL of these tweaks may be best for your environment. Try and decide which options would be best for you. If you have any questions, feel free to comment below or shoot me an email and I’ll do my best to point you in the right direction.

Incompetent FireBird Developer Woes

A client of ours (45 workstations) has practice management software that uses FireBird SQL Server as its back-end database. It stores all of their patient, billing, claims, scheduling, forms, and procedure data. Typical usage of the application by the front desk staff is checking in/out, scheduling, and collecting money from patients. They all pretty much reference the same data all day long. Typical usage in the back procedure area is pulling up a patient’s x-rays, history, and entering new procedure data. A couple of years ago, they started to notice their software was slowing down more and more.

Some History

They initially had two servers. One was the domain controller that also housed patient x-rays, practice documents, and their orthodontic software. The other ran their practice management software database (FireBird). We provided the domain controller and the other server came from their software provider.

Two years ago, we upgraded their old domain controller to a setup of two Ubuntu servers that use DRBD to replicate a partition holding QEMU/KVM virtual machine images from the primary to the secondary. This allows us to quickly restore services in the event of a catastrophic hardware failure on the primary server by simply setting the secondary as the primary and turning on the VMs. (I’ll do a how-to on this setup soon) When we initially engineered this project, we didn’t intend to virtualize their database server, but left plenty of overhead.

The Trouble Starts

After about six months of running great on the new domain controller, the office starts to notice slowness in their practice management software. The drives in the older database server were failing and the database was getting larger. We decided to offer to virtualize the old database server (Server 2003) and put it on our servers as we knew we had the overhead to support it. I allocated four cores and 4GB of memory to the server and it was running great for a few more months.

They started seeing more slowness as their database grew to over 30GB. After monitoring the resources on the newly virtualized server, I noticed all of the memory was being consumed, but no processes were reporting using more than 100MB. I called their software vendor and they explained that we were using an unsupported, non-standard platform and that had to be the cause. So I took some time and revioewed my QEMU/KVM settings and made sure I wasn’t missing anything performance-wise. Another call to the software company and they explained that sometimes the firebird processes don’t correctly show the amount of memory they’re using.

Something Isn’t right here…

The slowness continued to get worse and I was rebooting the virtual machine every lunchtime and night as the only option. The software company continued to blame our host machines and wouldn’t help diagnose the issue, even though the client had support with them. In an attempt to speed things up, I decided to create a new Server 2008 R2 virtual machine and migrate the database to it. I allocated six cores and 16GB of memory to the new VM. The software company did the database migration after I setup the server for them.

Initially, the performance was much better. After a few months though, it started slowing and I noticed the server was out of memory, again. I also noticed that there were roughly 160 firebird server processes running during the day. The software company explained that firebird creates a process for every connection to it that houses its own cache for each connection as well. I thought that seemed a bit inefficient, but figured since they’re the developers, they know best.

That’s where I went wrong

I started doing some research into FireBird and how it works and why it acts like it does. I started getting more familiar with the command line tools to monitor it and checked the version. As of two weeks ago, they were running 2.1.3 x86. I immediately started questioning the software company about why they were using a really old version. They said that’s the version they install on ALL of their servers. So I started digging deeper and noticed that in that version of FireBird, there is a known bug with it and 64bit OS’s. Specifially, how the system manages FireBird’s cache. I brought this to the attention of the software company and they said they couldn’t update the version of FireBird or it would break the software. So back to rebooting twice a day while I researched more options.

I started to read techniques for optimizing FireBird and realized that the software company had left all of FireBird’s options at default values. I approached them again and they said they had no idea what I was talking about and that if I changed settings, things would break. The continued to blame our hardware and started proposing a new server to the client. They also claimed that my client was the only one having these issues.

Not on my watch!

I decided it was time to start testing some of these optimization changes after taking snapshots of the VM. I started messing with the page buffers and the write-cache mode. Magically, the software still worked and was slightly faster. It still used all of the memory though. I decided to contact a buddy of mine who develops an application using FireBird as its back-end as well; and he gave me some pointers on page buffers and other settings.

During this time, the client got fed up with the software company and demanded a list of their clients with similar sizes so they could call and ask them about their performance. After some calling around, they figured out that they were not the only client with this issue and an even larger office had “figured out how to fix it”. They gave me the contact information to that office and I had a long, interesting talk with their IT administrator. Turns out, they fixed the problem by purchasing a MASSIVE enterprise class server. $15k later, the software would run fast and stable. One important note is that they had 120GB of memory, and 70GB of it was being used by FireBird. Their entire database was residing in the memory, which is why it “solved their problem”.

avoiding spending $15k on a software problem

Finally, I got fed up with waiting for a solution and decided on a list of changes we were going to make, regardless of what the software developer said. First thing we did was upgrade FireBird to current x64. We then switched to the superserver mode which uses one FireBird process and shares its cache with all connections to it. With this, we had to configure the FireBird process to use more than one CPU, which was an easy option in the firebird.conf. Next, we changed the database page size to 16KB, increased the maximum page buffers to 650,000, and told the database to use asynchronous write-caching . Then we changed the option for FireBird to use file-system cache or FireBird’s cache. We opted for file-system cache only. In some cases, FireBird will do both which essentially doubles its memory usage for no reason. Finally, we set the FireBird database to sweep its old transactions every 100,000 transactions instead of every 20,000. Sweeping eats CPU resources so we wanted to do it less frequently. They do 100k transactions in about an hour.

Today was the first day after making all of these changes and they are faster than they ever have been. Their software went from taking 16 minutes to open and be ready to use to 40 seconds. FireBird is using only 6GB of memory and about 20% CPU. (I decided to start small on the memory and increase it as we tested more) They have run all day without a single failure or slow-down. In the next couple of days, I’ll do a quick write-up of exactly how we made the changes we did and how to check the current settings. Here is the write-up on how we made our changes and how to check the current settings.

TL;DR:

Software company setup their database using defaults and had no idea how to optimize it. Instead of taking them at their word, we did research and made changes on our own which improved speeds exponentially.