Improving Behavior with Large Data

Jul 31, 2012 at 11:47 AM
Edited Jul 31, 2012 at 11:52 AM

I've been using R.Net for a while. While it helped me with me projects, now I'm running into a lot of issues when the scale became larger.

There is an inherent problem with trying to work with two memory managers, with different GC behavior at the same time. And the way R.Net works makes it actually more difficult to manage.

I'm thankful that the premature disposing of callback delegates are gone. However, I'm having trouble keeping R references intact.

Let's say we have the following code:

var vector = new CharacterVector(engine, input.Documents.Count); var names = new CharacterVector(engine, input.Documents.Count); for (var i = 0; i < input.Documents.Count; i++) vector[i] = input.Documents[i].ToString(); for (var i = 0; i < input.Documents.Count; i++) names[i] = input.Documents[i].Name; vector.SetAttribute("names",names); return vector;

Here, I'm doing a conversion from a set of documents into an R vector, with names. However it conveniently fails after 535 iterations in the first loop -- With ArgumentOutOfRange exception. (There are over 40,000 documents in total, so have only a few of them processed).

This is expected, since we did not take any precaution, the vector is garbage collected, and the Length became 0.

We go one step further, and do the following:

var vector = new CharacterVector(engine, input.Documents.Count); var names = new CharacterVector(engine, input.Documents.Count); vector.Protect(); names.Protect(); for (var i = 0; i < input.Documents.Count; i++) vector[i] = input.Documents[i].ToString(); for (var i = 0; i < input.Documents.Count; i++) names[i] = input.Documents[i].Name; vector.SetAttribute("names",names); vector.Unprotect(); names.Unprotect();

However we still fail. At iteration 2188, with AccessViolationException at InternalString constructor. This is actually a more serious issue, since we now have a memory corruption.

To make sure memory is more sane, I added some forced garbage collection:

 

var vector = new CharacterVector(engine, input.Documents.Count);
var names = new CharacterVector(engine, input.Documents.Count);

vector.Protect();
names.Protect();

for (var i = 0; i < input.Documents.Count; i++)
{
  if (i % 100 == 0)
    engine.ForceGarbageCollection();

  vector[i] = input.Documents[i].ToString();
}

for (var i = 0; i < input.Documents.Count; i++)
  names[i] = input.Documents[i].Name;

vector.SetAttribute("names", names);

vector.Unprotect();
names.Unprotect();

return vector;

 

Now we make it to 10400 iterations, and fail during GC with the same AccessViolationException.

This is very confusing, since protecting a vector is supposed to protect all members automatically. Since there is no direct interface to CharacterVector elements (InternalString), I change it's constructor to call Protect() automatically (this is a major hack, but just to try).

This time we complete the first loop, and fail at the second time

(Reminder, input.Documents.Count = 48947)

This is also expected, since R can Protect() so many elements (according to extension documentation 10K is the limit). We corrupt the memory by protecting everything.

Now we have a problem. If we do not protect the strings, they are automatically gone. However we have much more than the maximum (90K+ total, including names).

		/// <summary>
		/// Creates a new instance.
		/// </summary>
		/// <param name="engine">The <see cref="REngine"/> handling this instance.</param>
		/// <param name="pointer">The pointer to a string.</param>
		public InternalString(REngine engine, IntPtr pointer)
			: base(engine, pointer)
		{
		    Protect();
		}

		/// <summary>
		/// Creates a new instance.
		/// </summary>
		/// <param name="engine">The <see cref="REngine"/> handling this instance.</param>
		/// <param name="s">The string</param>
		public InternalString(REngine engine, string s)
			: base(engine, engine.GetFunction<Rf_mkChar>("Rf_mkChar")(s))
		{
		    Protect();
		}

(Also removing forced GC at this moment results in the same number of iterations).

Unfortunately I do not know the next step. This is why I decided to open this topic here. I'd be very happy if anyone wants to jump into the discussion.

Jul 31, 2012 at 12:48 PM
Edited Jul 31, 2012 at 12:49 PM

I "solved" the problem. However this code is greatly inelegant, and I'd prefer an actual solution.

Since it is obvious that we cannot transfer more than a few hundred items reliably, I decided to do this in chunks of 100. Each chunk is then concatenated using a temporary variable that stores the whole collection. Then the temporaries are deleted, and the result is safely returned.

	
       SymbolicExpression Convert()
       {
            var vector = GetVector(engine, input.Documents.Select(d => d.ToString()).ToList());

            var names = GetVector(engine, input.Documents.Select(d => d.Name).ToList());

            vector.SetAttribute("names", names);

            names.Unprotect();
            vector.Unprotect();

            return vector;
        }

        static CharacterVector GetVector(REngine engine, List<string> input)
        {
            if(input.Any(string.IsNullOrWhiteSpace))
                throw new ArgumentNullException("input");

            var vectorName = GetTempName();
            var blockName = GetTempName();

            for (var i = 0; i < (input.Count + 99) / 100; i++)
            {
                var size = Math.Min(100, input.Count - i*100);
                var block = new CharacterVector(engine, size);

                block.Protect();

                for (var j = 0; j < size; j++)
                    block[j] = input[i*100 + j];

                if (i == 0)
                    engine.SetSymbol(vectorName, block);
                else
                {
                    engine.SetSymbol(blockName, block);

                    engine.Evaluate(string.Format("{0} <- c({0},{1})", vectorName, blockName));
                }

                block.Unprotect();
            }

            var vector = engine.GetSymbol(vectorName).AsCharacter();
            vector.Protect();

            engine.Evaluate(string.Format("rm({0})", blockName));
            engine.Evaluate(string.Format("rm({0})", vectorName));

            return vector;
        }

        static string GetTempName()
        {
            return "t." + Guid.NewGuid().ToString("N");
        }