Ryax orchestrates Precision Agriculture data analytics workflows for CYBELE European project
Data workflow automation across hybrid IoT, Big Data, HPC and Cloud infrastructures will be one of the cornerstones of CYBELE’s agriculture & livestock farming program.
The importance of Agriculture in today’s societies cannot be undermined since it constitutes the backbone of the economic system of any country: not only does agriculture provide the obvious food and raw material supply, but it also generates employment opportunities for very large portions of the population (it is estimated that 70% of the people in the world directly rely on agriculture as a mean of living). Agriculture will play a key role in Europe’s near-future challenges, such as production efficiency in the context of ever-scarcer natural resources, rural people empowerment, and sustainable agri-food industries development. The usage of advanced technologies such as sensors, IoT (Internet-of-Things), satellites and drones to the profit of Precision Agriculture has substantially increased the data needed to be processed. Moreover, accuracy and time-criticality are starting to become important requirements in these contexts. State of the art technologies such as Big Data & High Performance Computing can definitely help in addressing those issues but they have never been combined for use cases such as Agriculture and are not simple tools to be used by non IT experts.
The CYBELE project aims at tackling these major challenges by implementing new models of convergence and abstraction, putting together state-of-the-art technologies in the fields of High Performance Computing (HPC), Big Data, Cloud Computing and IoT (Internet of Things) to revolutionize farming, reduce scarcity and substantially increase food supply. CYBELE will enable farmers to apply tailored care and manage resources more effectively, with a goal to boost production, improve efficiency while minimizing waste and environmental impact.
CYBELE will roll-out a complete set of demonstrators covering 9 topics in total: from protein-content prediction in organic soya yields, to climate smart predictive models, to autonomous robotic systems, to crop yield forecasting, down to sustainable livestock production, aquaculture and open sea fishing.
CYBELE project at a glance: Empowering Agriculture and Livestock Farming by integrating multiple data sources with Big Data analytics on supercomputers
Cybele provides a platform targeting Precision Agriculture and Precision Livestock Farming use cases enabling:
- the fusion of a vast plethora of data sources & types (satellite images, hyperspectral drone videos, weather and climate stations, market pricing, sensor measurements, stock GPS location, RFID, business-specific historical data etc);
- the development of business specific (agriculture and farming) software programs (crop growth/disease prediction, animals weight/health prognosis, farm extreme conditions forecasting, fish feeding optimization, etc) leveraging big data enabled algorithms along with integration of artificial intelligence and data analytics frameworks (Tensorflow, Caffe, Horovod, Spark, etc) to optimally exploit the power of data;
- the execution of Big Data analytics on High Performance Computers to efficiently deal with the extreme data processing, high accuracy and time criticality needs while guaranteeing end-to-end security and data privacy. This service is provided by completely abstracting the complexity of deploying simulations on supercomputers, thus offering a transparent usage of highly optimized computational resources to ‘non-IT experts’
Ryax contributions to CYBELE: data analytics workflows orchestration converging Cloud, Big Data, AI and HPC technologies
One of the principal technical goals of CYBELE project is to combine Cloud, Big Data and HPC for the development and deployment of data analytics while going beyond the current state of the art concerning the convergence of these technologies. The abstraction brought by the Cloud, the elasticity of Big Data and finally the performance achieved by HPC, combined together, will play a crucial role in the project and will promote their utilization in newly digitized contexts such as agriculture and farming.
HPC for performance
High Performance Computing is characterized by the utilization of specialized high performance hardware, systems software and programming models while enabling bare metal executions for optimal usage of compute resources. These infrastructures pay particular attention in security, isolation and efficiency but impose steep learning curves to users wishing to exploit their capabilities. The most prominent programming model in HPC is MPI which enables parallel high performance execution. MPI is a parallel and distributed programming standard created to facilitate the transport of data between the distributed processing elements in supercomputers. However, its excellence in performance is counterbalanced by bad elasticity, fault-tolerance and development complexity.
Inspired by the Cloud, HPC is starting to evolve. Allowing the deployment of fully personalized execution environments has been a long time wish of HPC users and containerization seemed a natural evolution. However, Docker, which is the defacto standard in Cloud, is not acceptable because of various issues in security, isolation and performance. Hence other containerization systems, specialized for HPC, have been developed; such as Singularity, Shifter, Udocker, etc and their adoption in HPC is currently ongoing. Adopting containerization in HPC is a first important step towards simplification of usage and convergence with Cloud and Big Data.
Big Data for elasticity
Big Data is often characterized by the 3Vs: the massive Volume of data to be processed, the wide Variety of data types (structured and unstructured) and the Velocity at which the data must be analyzed. Big Data technologies have been initially coined by online enterprises such as Google, Yahoo and Facebook but are now widely used in all areas involving data analytics. Hadoop MapReduce and Spark are the most prominent programming models for data intensive computing. In particular Spark has the advantage to be elastic, fault-tolerant and its high level of abstraction makes it a much simpler parallel and distributed framework than MPI.
Artificial Intelligence techniques such as Machine Learning and Deep Learning, often developed using frameworks such as Pytorch and Tensorflow, are considered subgroup of Big Data. A typical use case deployed on hybrid infrastructures is to perform the training of Deep Learning models upon HPC specialized hardware such as GPUs and do the inference on commodity computing resources. Similarly, use cases may involve the usage of HPC accelerators for both training and inference to achieve efficiency and high reactivity. Various ongoing open-source efforts allow the usage of Big Data frameworks on HPC systems such as Horovod which, under the hood, uses MPI to optimize distributed learning with Tensorflow on HPC systems. Other efforts involve the RDMA optimized Big Data stacks exploiting the Infiniband high performance networks of HPC infrastructures or the GPU-enabled runtimes through open-source libraries like Rapids which can utilize GPUs with fine-grain optimizations.
Cloud for abstraction
Cloud Computing enabled us with the capability to make use of computing resources on demand and simplified the usage of distributed computational resources through hardware virtualization while allowing multi-tenancy and abstraction with the usage of containerization. Recent progress in Cloud technologies brought forward the advantages of using disaggregated hardware resources (CPUs, memory, storage, etc ) combined with the emergence of serverless runtime, for cost-efficiency and programming flexibility. Serverless cloud programming, composed of Cloud functions (FaaS) empower the users with a new general purpose compute abstraction that saves them development time, simplifying the design of Cloud applications while offering better autoscaling, stronger isolation and lower costs. Bringing the benefits of Cloud technologies in Big Data and HPC environments is an active research area. The convergence of Cloud with HPC and Big Data passes through integration of their systems software to collaborate more tightly in optimal ways. In this regard, the orchestration software Kubernetes, which is the standard in Clouds, with its mechanisms to abstract the complexity of managing resources and containerized applications, has recently started to orchestrate Big Data frameworks such as Spark and Artificial Intelligence libraries like Tensorflow. Kubernetes can play the role of high-level orchestrator of both Big Data and HPC workloads. For this, open-source efforts such as the Singularity containerization solution along with its Kubernetes and Slurm integrations can be used as basic building blocks in the design of the software architectures towards the convergence of Cloud, Big Data and HPC.
The applications in Cybele will be developed in the form of workflows hence workflow management systems need to be adopted in order to automate the orchestration of workflows at a higher abstraction level and communicate with the Kubernetes orchestrator along with the different HPC and Big Data resource management systems. A plethora of workflow management systems exist with different characteristics. Some are used in HPC (Pegasus, Makeflow, etc) whereas others are more adapted for Big Data environments (Airflow, Argo, etc). However the execution of workflows enabling both HPC and Big Data workloads is not fully supported by these tools and further research along with new solutions are needed.
Ryax Technologies contributions
Ryax Tech contributes its expertise in the orchestration of hybrid workloads upon specialized compute infrastructures and its 10+ years of experience in distributed computing and HPC resource management. Ryax Tech leverages upon the development of its data engineering software platform, Ryax, dedicated for data analytics workflow management on hybrid infrastructures and makes use of the open-source building blocks of Ryax as basic bricks for the systems software architecture in CYBELE. Furthermore, the usage of research oriented works upon HPC and Big Data collocation will be used for the design of new R&D solutions supporting the co-scheduling of HPC and Big Data workloads to be integrated within both Ryax product and CYBELE systems software platform.
In more detail, Ryax Tech is actively involved in the technical workpackages of CYBELE and organizes the efforts around HPC, Big Data and Cloud convergence. Ryax Tech is leading Workpackage 2, dedicated to the infrastructure implementation which implicates the design of the CYBELE systems software which will control the computational resources. The architecture is spanning from the workflow management systems, responsible to combine HPC and Big Data workloads with the resources abstractions and Cloud/Serverless like functionalities; to the dedicated HPC and Big Data resource management systems, for optimal performance; passing from the environment deployment tools, to bring simplicity of usage; and from the programming models and runtimes along with their optimized HPC versions, used as building blocks for users to develop their hybrid applications; down to the infrastructure access security, which will guarantee safe and confidential policies and techniques for the access of computational resources.
The design of the CYBELE architecture along with the various deliverables and project milestones will be available online in the Cybele project web site. Upcoming blog posts will describe more in detail the different components of Workpackage 2, will explain how the design of Ryax product helps the evolution of CYBELE infrastructure implementation and vice versa, along with how the open-source building blocks are used and adapted to fit both cases.
Finally, since an important goal for Ryax Tech is to enable a tight integration between Ryax and CYBELE platforms; as Ryax product and Cybele project evolve, Ryax clients, from the agriculture industry, will have the opportunity to benefit from CYBELE platform and the power of deploying workflows on HPC centers, effortlessly. In addition we will showcase the advantages, for CYBELE clients, to consider on-boarding with the Ryax data engineering platform in order to address their data analytics for precision agriculture & farming use cases by offering execution of their workflows from the HPC center down to more constrained Fog and Edge resources seamlessly, securely and efficiently along with an easy to use web interface to express workflows using high-level abstractions.
Get to know more about this project and have a look at the CYBELE website here.
The Ryax Team.