Data Import Program: Part Three

Home > Blog Posts > Dev > Data Import Program: Part Three

Tags:

dev

Published: Apr 23 2017

Estimated Read Time:

Further Updates

Parts one and two.

As I said I knew it wasn't fast enough but it might be for the day. It only had one opportunity to run and it took roughly 2 hours (rounding up) and ran to completion. I had already been testing way to reduce that though since I don't know that it will be consistent and that is still a pretty long time for the small amount of records being imported.

I knew that my loop was taking up a lot of the time. A simple foreach loop that had to iterate 180k times-ish (varied by file). It's not like it was too much work but I started looking into ways to improve that. I stumbled onto a Parallel.ForEach command. Adjusting the current loop to use that was no big deal. I started to debug and quickly found out that it wasn't so easy.

It did run. That wasn't the issue. The issue was that each loop wanted to run a SQL Command. Very quickly my error logging was letting me know that there were no open connections...over and over and over. Some of the records made it to the database but not all of them.

That's not good enough

The first thing I changed was to rewrite the SQL logic. Instead of running one insert per loop I would just store them all in a string collection and write it all at once! I knew I was using threading so I looked it up and there is a thread-safe collection available! How lucky!? So with my new ConcurrentBag<string> collections I was ready to get stuff done. The variables help the values and the insert would be prepended to the string.

Not quite. Half the time it would fail to run because my collection would fill up. I didn't consider it but I was getting out of memory errors for one of the files. If I tried to pause and see what the current string was I would crash my Visual Studio instance. It wasn't that there were too many items just that there were too many characters (that is my understanding). I did what any sensible lazy person would do. I split the collection in half. The first 25k would go to one collection and the rest would go to another collection.

Problem solved meme

Alright so it's not really a great solution. What if the collection just fills up again? Well, there will be enough room for more records than I have and the file has never had more than 50k entries. It should hold for a while. My initial fix was to write the data and empty the collection every 10k entries. I don't know if that will even work because of the threading but it's most likely the better idea.

Alright so now it won't run out of memory. It's go time!

Simpsons Cereal Fire

I don't deal with large amounts of data often and this is the first time since I have been at my current job that I am doing an insert with so many records. I had encountered this before but forgot about it. It turns out that there is an insert limit in SQL Server. One thousand inserts at once. At least my logging told me exactly what I needed to know.

This time I created a simple for loop to run after the collections were filled. I would take 1000 records at a time and then move on to the next 1000. The code was actually pretty simple with LINQ.

for (int i = 0; i <= Math.Round((double)(queryvalues.Count / 1000), MidpointRounding.AwayFromZero); i++)
{
fullquery = insertquery + string.Join(",", queryvalues.Skip(i * 1000).Take(1000));
}

With this new loop I would insert 1000 records at a time with one connection for each.

Looking at the logs the time to run is all over the place. One file took 80 minutes and now takes 45 minutes. The next file was at 95 minutes and is now at 9 minutes. The last file was at 50 minutes and it now takes 80 minutes. These are comparing threaded, single insert in production against my development machine with improved threading and bulk inserts. I will update when the production runs and see if there are improved times.

UPDATE:

In production one file runs in 100 minutes. The other two fail with SQL errors. I have had to revert back to the unparalleled version with simple threading as that was the latest working version.

I did find out rather quickly that it takes all or most of the resources on the server. I tried to reduce the number of threads it would use in each Parallel.ForEach but it doesn't appear to have any effect. I limited each just in case it decides to try to create a million threads each. Locally I am not seeing a slowdown at all. The process runs at midnight so I am going to watch and see if it's fast enough that using all the resources might be an issue. From what I have read you (the developer) should not be worried about how the OS allocates the CPU but there may be ways to limit it if you must.

I also added email so that if there was an error it would email instead of making me go to the server to check on it and open the logs.

UPDATE 2:

In the end I had to remove the Parralel.ForEach and stick with just a thread for each file's process. The server was still having issues but after the system admin increased the CPU that server could use the process runs much faster and to completion. With the times being 10 mins, 17 mins, and 70 mins. No more timeouts, no more missing records, and a job completed.

Final Changes

Obviously there is further room for improvement. There will be some work to stage the data until the task it done. Then if there are no failures push it to the production table. This way if there is a failure the tables won't be empty they will have old (but still useful) data.

Previous:
Data Import Program: Part Two

Next:
Counting Pixels By Color