Cloud-based render-farm SAAS for visual effects and 3D computer graphics

At that time, the cloud services provider (CSP) we worked with had recently acquired a small Cloud Rendering startup. As a result, there was a need to transform their ad-hoc, MVP stack into a platform-level product. With the overall aim to scale the customer base from several “friends-of-friends” Visual Effects studios to tens of thousands, it was necessary to ensure service reliability and quality CSP users’ expectations and to make it compliant with stringent industry regulations.

The compute requirements of 3D rendering (usually raytracing) and simulation are so massive that even the huge on-prem clusters of giants like Disney, ILM, Weta and Pixar are stretched to the limits. Expanding to the public cloud is considered a viable alternative for some, more technically capable medium-sized production houses. This was CSP’s target audience.

The problem

MVP codebase for this service was not scalable or security-hardened. It failed constantly, inflicting massive monetary damages to both customers and Google and lack of features and compliance blocked massive market segments. It was also quite inefficient and very expensive to operate (for Google).

Core requirements:

Dramatic reliability improvement to go from double digit percent failure rates to sub 1%.
Address critical security and compliance issues (elimination of exposed SSH servers, NFS filers, unsecured render nodes), wipeout, GDPR etc.
Development velocity through fast and cheap dev environment provisioning, end-to-end automated testing on all levels.
Seamless rollout of the new versions to thousands of existing customers.
Feature development backlog.

We were tasked with developing new architecture and rolling out new codebase without noticeable disruptions to current users.

The outcome

Rollout of the new system was smooth and relatively quick: it implemented the legacy system API, ingested its operational data and the customers were gradually switched over at the load balancer level.

The removal of the market blockers allowed the service to grow to tens of thousands of customers, tripling its run-rate and unlocked developer velocity allowed for high priority feature backlog to be drained in a quarter.

The solution

The main change from the legacy stack was the transition from VM-based per-customer Cloud infrastructure to Hosted Kubernetes and docker based shared-everything setup. This change, which was effectively a rewrite, resulted in dramatic improvement of security, reduction of infrastructure costs, more flexibility with rendering application version selection, as well as latency and utilization boost.

Docker images with the rendering software, as opposed to VM images, could be layered (which resulted in faster build times), tested in a much more controlled environment and smoothly rolled out. Allocation of development environments also became much less painful.

Automatic asset/dependency discovery in the general case, is an unsolved problem in the industry. Many jobs were failing due to missing files, often after incurring significant setup costs. By introducing a custom file system layer that paused the batch process on hitting a missing file and synchronously fetched them from the source machine, the failure rate was reduced dramatically.

A custom batch scheduler was developed for Kubernetes, as its stock scheduler failed when handling tens of thousands of unschedulable tasks for batch jobs like rendering. This scheduler allowed the system to scale to effectively unbounded task backlog as well as huge clusters with multiple thousands of machines.

Cloud render farm SAAS

Cloud-based render-farm SAAS for visual effects and 3D computer graphics

The problem

The outcome

The solution