Architectural evolutions of R.NET for multiple concurrent R session

Developer
Nov 15, 2014 at 1:03 AM
There is a demand for handling multiple R session in conjunction with R.NET, and improved features to run it from ASP.NET.

What this discussion is about:
  • Confirm high level use cases for R.NET requiring multiple sessions of R
  • Gather technical and architectural insight into where to evolve R.NET
In your replies, please:
  • post only short summaries of your high level use cases, with link to more details if need be.
  • Technical and architectural proposals can be more substantial; conciseness remains valued.
Known high level use cases
  • Web app multi-user client via ASP.NET
  • Running on cloud computing e.g. R.NET on Azure
  • Cluster computing with MPI
Constraints
  • The R interpreter itself is not safe enough for most multi-threading operations. This is not something R.NET can do anything about except limited workarounds.
Scenarios
  • .NET application with single-session R embedded. This is what R.NET currently excels at, and the fastest way to interop with R in terms of MB/sec.
  • .NET app + R.NET_client --> <some channel> --> R.NET_server + R
    <some channel> can be Inter Process Communication (IPC), WCF, TCP/IP etc. MPI is possible and has been done, though in practice used only in "pure" scientific computing so far as I know.
Related work
  • RProvider.Server, part of (RProvider)[https://github.com/BlueMountainCapital/FSharpRProvider]. The protocol used is Inter Process Communication (IPC). I think this is a sound starting point to evolve into a component of R.NET.
  • RServeCli is the equivalent of the layer R.NET_client above. The server is R+RServe, no .NET involved. The channel protocol is limited to TCP/IP with binary serialisation.
  • RStudio and its handling of R sessions. This is however probably not something that can be reused, for licensing reasons.
Other general technical considerations for R.NET
  • R.NET SymbolicExpression is inheriting from SafeHandle. The disposal of the objects is (usually) happening during the finalization of these objects. This is inherently an issue since this is done on a separate thread, clashing with R limitations. It is worked around with a mutex locking, but this remains problematic. A revision of the R data representation in memory intersects with what is needed to expose R data structures across multiple processes.
  • rClr relies increasingly on R.NET for more sophisticated data interop.
Looking forward to your ideas.
Cheers,
J-M
Nov 15, 2014 at 8:26 PM
We have a web based analytics platform running under IIS/ASP.Net MVC/WebAPI hosted in EC2 or on physical boxes that we wanted to integrate R as an exploratory data tool for data analysts and research statisticians. We've developed a notebook-style front-end that sends chunks of R code to the back-end and displays the console output as well as any plots produced during the request. We currently output the plots as svg via the device PInvoke layer, but we'll be adding a png device to handle large svg's >50Mb (one plot is larger than 2gb of svg data). Because we are multi-user, we required per-session (or per-user, depending) R instances and we also wanted the ability to stop an instance and restart an instance, even if we lost the per-session data while doing so.

At the time I knew nothing about the mechanics of embedding R and nothing about the structure of R.Net. With that in mind, I chose to use WCF as our IPC mechanism and use a lightweight .exe as the host for the R process and WCF services. One of the problems with using WCF as the IPC mechanism is that the WCF hosting options weren't geared for WCF services that could be dynamically created and shut-down at will. Using a lightweight .exe for that solved those problems, but also required the implementation of a broker for creating and destroying the processes.

Architecture Diagram

Initially, I thought that the R API was exposed via a flat PInvoke layer and the object model was built on top of that, so I thought I would just shim in a WCF wrapper around the R API and I wouldn't do much with the object model. That turned out to not be the case, and I pulled all the R calls down to a flat layer with the object model calling into that flattened view. I did break up the API into three loose chunks for granularity; one for dealing with SEXP's and R objects, the other for managing the R runtime, and the last for dealing with console and graphics output.

I also ran into issues with platform specific pointer and memory arithmetic, mostly inside the vector objects. I didn't want to make assumptions about the size of things in the out-of-process R instances, so I pushed all that logic out of the object model also. There were also some platform specific management routines, like setting the memory limit, that got repackaged into a windows or unix layer and can't be seen from the client side.

I didn't like the object model inheriting from SafeHandle, so all handles are aggregated into the parent object, which is usually a "Symbolic Expression". Generally, I like the smallest object I can have to end-up in the finalization queue.

I changed the way the graphics system gets initialized, and devices are part and parcel of the server system. In reality, we only need a couple of devices and if someone wants to add a special one, that can be done and integrated as a PR.

Others have talked about wrapping R.Net in WCF from the outside and I think that's ok, but if you're using the object model, you'll have similar issues and you'll have to marshal memory from .Net to WCF to .Net to R, instead of .Net to WCF to R. If you're just looking for executing strings and getting output, it doesn't matter much, but if you're integrating C# and R, I think it does.

Low priorities for us are the C# object model and cross-platform code, so that' fairly messed up.

I'm a fairly strict TDD advocate, so I had some issues working on this. I'm cleaning up the state of tests and pulling over the R.Net tests that I think are valuable. I was planning to open up my changes on github after I had completed those tasks this coming week.
Nov 16, 2014 at 4:37 AM
Excited to see this coming together and the move to github. I just wanted to add a few comments related to cross platform compatibility.

IPC in unix and windows are very different, and several things (Mutex classes, WCF), are not really supported on mono. In fact, to get the F# R type provider to work on my Mac I needed to remove the IPC server Code here and put it back in pro. Given all the recent open source stuff in .NET, I am not sure if this picture will change soon.

However, RStudio is AGPLv3, so if there sessions model is good (I know nothing of it), perhaps it would be worth investigating more.
Nov 16, 2014 at 4:46 AM
RStudio has a custom CPP backend that communicates using JSON over HTTP. Its a fine approach for cross platform support and that source tree was the first thing I looked at when looking for client/server models for R.Net

For plot output, they generate images and store them on the backend with some sort of cache and paging manager which works well in a single session setting.

For us, the original desire for SVG output was so that we could style it similarly to our other data visualization framework (D3), but given the size of some of the plots, and default styles of higher level abstractions like ggplot2, svg is less than stellar at times and generating images is the way to go. I don't think I want to manage multi-session image caches though, but that's a debatable choice.