How to load data efficently?

Sep 22, 2013 at 8:26 AM
Edited Sep 22, 2013 at 8:28 AM
Hi there,

I am efficiently trying to use R.NET to calculate a linear regression on an array of Boolean data (against a numeric dependent variable).

This is the loop I use!
    regLoad1.Start();
    LogicalMatrix teckLogicalMatrix =
                            rEngine.CreateLogicalMatrix(teckBoolMatrix);

    NumericVector otpNumericVector =
                            rEngine.CreateNumericVector(otpDoubleArray);
    regLoad1.Stop();

    regLoad2.Start();
    rEngine.SetSymbol("TECK_MATRIX", teckLogicalMatrix);
    rEngine.SetSymbol("OUTPUT_VECTOR", otpNumericVector);
    rEngine.Evaluate("TECK_FRAME <- data.frame(cbind( y = OUTPUT_VECTOR, TECK_MATRIX )) ");
    regLoad2.Stop();

    regExecute.Start();
    rEngine.Evaluate("t <- lm( y ~ ., data = TECK_FRAME )");
    regExecute.Stop();
Looking at the stopwatches on these operations --
     regLoad1:  11,655ms
     regLoad2:  9ms
     regExecute:  135ms
How can it be taking so long just to convert my arrays to their R.NET equivalents? Certainly many many times the length of time that it is to set the R symbol, or do the R regression with lm().

Is there a better way to do this?

For reference my arrays are sized at 176kb (bool array), and 235kb (double array) with 29,000 rows a piece, and 15 columns in the bool array.

Thanks a lot for the help!

Donald
Developer
Sep 22, 2013 at 10:39 AM
Well observed and documented Donald.

Based on what I observed developing rClr, I'd make an hopefully informed guess that it is due to the way the vectors are initialized in R.NET. See below the code in the class Vector.cs. Setting the value at each index of the vector triggers a marshalling from .NET to the native world. This is a costly operation in CPU terms.

R.NET\Vector.cs
      protected Vector(REngine engine, SymbolicExpressionType type, IEnumerable<T> vector)
         : base(engine, engine.GetFunction<Rf_allocVector>()(type, vector.Count()))
      {
         int index = 0;
         foreach (T element in vector)
         {
            this[index++] = element;
         }
      }
The code is neat, concise yet generic. I suspect improving the performance by crossing the .NET/native boundary once for the whole array will require more complicated code. I now know more about how to do native to .NET transition, so not clear what's needed in R.NET for the converse.

Given what I measured from rClr, I expect it is possible to put the throughput rate to be ~20 million bytes/seconds for numeric data, so the positive is that there is a lot of improvement possible.
Sep 22, 2013 at 11:27 AM
Edited Sep 22, 2013 at 11:27 AM
Perhaps I should try and use the COM interface from cran? [ 11.5s on load is too much to work with my application ]
Problem is it such a gigantic pain, only way I get can get it to work is by installing the big ugly RAndFriends.exe which includes splash screens, Excel and Word integration and other nonsense. And even so I am not sure it works with v3.0+ which is what some of my R libraries need!
Ack!
Developer
Sep 22, 2013 at 8:50 PM
I'll look at what's doable. I'd actually like to tackle this problem but struggle for time to do it.

Having answered a couple of questions on stackoverflow on it, COM based packages seem to be having problems with R3.0. I don;t know the whole story but I get the feeling that a CRAN policy change also booted related packages out of CRAN repo, by the way. That said, it may still be a workaround for you in the short term.
Sep 23, 2013 at 12:42 AM
That would be simply awesome JM.... as you said, with your experience with R to CLR you are likely far better positioned to solve this problem than anyone else.
My programming is not going to be up to the mark, but if there is anything else I can do to help (donate/etc) , let me know. A good way to access R from .NET is just such a powerful thing for the statistics community!!
Developer
Sep 23, 2013 at 1:23 AM
Easier than I thought actually. Nice change from most tasks I usually undertake...

Tested a change on numeric vector and the runtime speed looks like 100 times faster for ~10MB data; yet to see how it scales to very large data but very likely will hold.

Well, it is about time I contribute to R.NET per se in code besides discussions posts.

Kosei, if you happen to read this post I'll submit code on a branch named after the issue identifier: https://rdotnet.codeplex.com/workitem/56
Sep 23, 2013 at 6:24 AM
You beauty!
JM:
You're a star!
Developer
Sep 23, 2013 at 8:56 AM
OK, you may have a way out of your perf trouble.

I attach the output of the profiling and improvements done so far at the page https://rdotnet.codeplex.com/workitem/56
  • Only the new faster CreateVector operations is shown. ~2 orders of magnitude faster is what observed.
  • Middle section is the comparison before/after for R to .NET array conversions. This is insanely faster: ~3 orders of magnitudes. I needed to check several times this was not a "fake" measure. Seems not. Basic numeric tests confirm it is working properly.
  • R to .NET conversion tops at 400ms for 100 million long numeric vector. Beyond that is too large to test for R by default it seems even x64.
Code is commited to a branch named workitem56. You can pull changes and have a go if you wish.
You should consider this code branch alpha, would it be just because the API will probably change after discussion with the main author: there is a new ToArrayFast method that may not be the most elegant way to override the default ToArray method extension of the .NET framework.

Cheers
Sep 24, 2013 at 3:44 AM
Excellent stuff!!! Really good news, Can't wait for it to find it's way into 1.5.6 !!!! What a dramatic improvement!!!

Quick question:
I noticed with the tests the new method was performed for numeric and integer vectors. The issue I was having above was largely with a big Logical Matrix. Will the same improvement apply?

Cheers,
And thanks again,

Donald
Developer
Sep 24, 2013 at 3:51 AM
I have yet to tackle matrices, but I expect the same kind of improvements. Harder thing will be to find the time to do it really. Day job keeping me busy with other lines of work.

One thing I forgot to mention: runtime is unlikely to change much for large character vectors or matrices. There is probably a way to do things differently but this is not immediately straightforward, and posibly not possible without adding some C++ code, which would be a pity given R.NET's pure C# design.
Sep 24, 2013 at 12:15 PM
I think that's okay - vast majority of R work will be done on integer and numeric dataset (in my expectation!). A great win for now.
If there is anything I do to help with the matrix upgrade, or otherwise testing the existing changes (I just downloaded workitem56!) let me know what I can do! :)
Thanks again for this great upgrade, made my day.
Developer
Sep 24, 2013 at 2:01 PM
I just added the work on the matrices (although some of it not implemented for Character matrices - throws exception on ToArray() ). Very basic unit tests to check there is not an obvious mistake, but more numeric tests are needed. If you can have a look and give it a spin on your data this would be good; always good to have a pair of independent eyes. Again: caution about using this for 'real' stuff before a more tested release - your call if you do.

Cheers.
Sep 24, 2013 at 7:50 PM
No worries - will give it a go and update!
Thanks! Great stuff!
Oct 5, 2013 at 1:39 AM
Edited Oct 5, 2013 at 1:41 AM
Blast-
Downloaded the workitem56 branch source code, and tried to build new dll --- but when I copied these [ RdotNET.dll, RDotNet.NativeLibrary.dll, RDotNet.XML, RDotNet.NativeLibrary.xml ] over my existing ones in my project folder (from 1.5.5) suddenly it can't find a bunch of functions?

Image

Will the (excellent) speed up of loading the vectors and matricies be in the main version soon? Or should I persevere?
Developer
Oct 5, 2013 at 9:03 AM
Hello,

Very surprising; just to be sure I checked out the branch head, built and then create the demo application in the home page referencing the newly built binaries: there is no issue finding CreateNumericVector. Do you still have the RDotNet namespace imported?
using RDotNet; 
If this is what is not present, then I don't know how it worked previously.

without this using statement the extension method on the static class REngineExtension will not be found.
if you do not wish import the namespace RDotNet, for some reasons, there may still be a way to import only REngineExtension and its methods; not sure which.

Cheers