A Bug Story: Multi-Threading
Our application is an add-in hosted within Excel. We were getting consistent crashes in our BVTs every day – specifically a “process cannot access the file because it is being used by another process” error, with the file being one of our product DLLs. This wouldn’t happen on the same test case every time, but one of our 8 BVT cases would inevitably reproduce the problem. And since it was a BVT, the repro steps were pretty much open the application and click “Next.” And when you would go in to reproduce it manually, it would usually work. However if you sat there and just repeatedly do it, it would repro by hand maybe one out of 20 times.
So I go to show it to the developer, who was in the middle of coding new features and he immediately redirects me to his lead. So I explain the problem, this lead listens to me, tries it once on his computer and declares that this isn’t a bug and I should leave his developers alone while they are coding.
Feeling frustrated, I made a little application that just ran this test case, stripping out all unnecessary test code (like logging, reading resources for Intl, etc.) and was able to reproduce this on the BVT machine, and on my personal development machine on demand. I went back to the dev lead and said I had this application that should show him the problem. He ran it on his machine and sure enough, it didn’t repro. He again repeated that I am wasting my time and should be doing more manual testing.
It was clear I was going to have to figure this one out by myself if I wanted the BVTs to pass any time soon. Since I could reproduce it on my machine, I narrowed down the problem to the exact check in that started reproducing this. There were only a few changes that would have affected this area. I was able to reproduce this on my development machine so I did a binary search by syncing the source code to different points in time and was able to single out the one change that seemed to be causing this.
Once I found the check-in, I used a tool (windiff) to find the difference on the code that changed. From the check in that started causing this problem, we noticed that our application spawned multiple threads to pull data from a resource. Before, we only spawned one thread at a time sequentially. When this bug was introduced, we spawned about 20 threads at a time to asynchronously pull data from a resource. We had a hard time to repro this bug as it turned out that this problem happened only on multi-processors environment. On multi-processors environment, the OS will spread their operations evenly across the various processors, so these 20 threads that we spawned got processed in parallel across various processors. As these threads got processed in parallel across various processors, it was causing race condition when these threads were fighting over a resource and they were causing an exception to be thrown intermittently.
Now, on a single proc machine (which was the dev’s machine in this case) this isn’t a problem because two threads can’t access that critical resource at the same time. On a dual proc machine (which the BVTs were running on), this would repro. So I made a quick PowerPoint presentation explaining what was happening and showed it to the dev lead that was so dismissive of this bug.
The fix was to be more careful about how we created threads. We are synchronously loading data now to prevent problems like this happening in the future. And the dev lead hasn’t “Not Repro’ed” one of my bugs ever since!
About the Author:
Ricky graduated from University of Washington with a Computer Science degree and joined Microsoft right after graduating as an SDET with the Office PerformancePoint Server team. This v1 product provides all of the functionality that is needed for performance management including planning, budgeting, scorecards, management reporting and analytics. Ricky’s passions is test automation and his team invests a lot in this effort which enables them to catch regression bugs as soon as they are introduced by running test automation with minimum manual intervention. Test automation has played an important role in the team by increasing their test process efficiency.