Département Informatique

Computer Science Department

Postdoc - Distributed storage for stateful serverless computing

Postdoc offer

 

Background  The recently funded H2020 CloudButton project [1] aims to democratize big data by overly simplifying its programming model with the help of serverless technologies.The core idea is to tap into stateless functions to enable radically-simpler, more user-friendly data processing systems. Average users of the cloud do not want to spend hours understanding complex analytics stacks (e.g., Spark |2], Yarn [3], or Ignite [4]), and to struggle with the choice of instance types, cluster sizes, etc. What they want is just a simple interface to execute their optimized, single-machine code in parallel. CloudButton is the technological response to this emerging need. To demonstrate impact, the project targets two strategic settings with large data volumes and diverse analytics requirements: bioinformatics (genomics, metabolomics) and geospatial data (LiDAR, satellital).

 

Objectives The main objective of this position is to specify and implement the storage layer of the CloudButton stack. Serverless computing infrastructures deliver massively-parallel short-lived functions where computation can quickly scale up and down. To cope with this transient nature of computation, the storage layer needs to be auto-scalable and it must support ephemeral data, lasting only for the duration of the serverless function calls  [6]. One possible approach is to co-locate data with computation and thus operate the storage layer during a short amount of time. Another key challenge is that storage should help transitioning from single-machine code to the serverless infrastructure. This requires to offer not a simple binary storage, but instead a full library of complex objects, as commonly found in modern programming languages. To improve code modularity, objects need to be composable. For performance, the storage layer may also split them transparently to the serverless functions. A third challenge is that data should be shared among the serverless function to support stateful computation. Storage should thus include appropriate concurrency control mechanisms to manage concurrent accesses and guarantee data consistency.

 

Work Plan For starters, the storage layer will be built atop the Infinispan data grid [7] developed at RedHat, a CloudButton partner, using the contributions made to the Crucial framework [8]. To demonstrate applicability, the postdoc will port an existing machine learning library to the CloudButton stack and evaluate it in practice using standard data analytics workload.

 

Start date As soon as possible, for a duration of 12 to 24 months. Accepting applications now, will remain open until filled.

 

To Apply Required skills and background:

Please provide:

 

Contact Pierre Sutra

 

References