FAIR principles for AI models with a practical application for accelerated high energy diffraction microscopy

We present two main results:

FAIR and AI-ready experimental datasets. We published experimental datasets used to train, validate and test our AI models via the MDF, and created Big Data Bags (BDBags)26 with associated Minimal Viable Identifiers (minids)26 to specify, describe, and reference each of these datasets. We also published Jupyter notebooks via MDF that describe these datasets and illustrate how to explore and use them to train, validate, and test AI models. We FAIRified these datasets following practical guidelines9,12,13.
Definition of FAIR for AI models and practical examples. We use the aforementioned FAIR and AI-ready datasets to produce three AI models: a baseline PyTorch model, an optimized AI model for accelerated inference that is constructed by porting the baseline PyTorch model into an NVIDIA TensorRT engine, and a model created on the SambaNova DataScale® system at the ALCF AI-Testbed. We use the ThetaGPU supercomputer at ALCF to create and containerize the first two models. We use these three models to showcase how our proposed definitions for FAIR AI models may be quantified by creating a framework that brings together DLHub, funcX, the MDF and disparate hardware architectures at ALCF.

FAIR and AI-ready datasets

We FAIRified a high energy microscopy dataset produced at the Advanced Photon Source at Argonne National Laboratory. We split this dataset into a training dataset that we used to create the AI models described below, and a validation dataset that we used to compute relevant metrics pertaining to the performance of our AI models for regression analyses. We published these datasets via MDF and provide the following information:

Findable

Unique Digital Object Identifier (DOI) for Training_Set27 andValidation_Set28
Rich and descriptive metadata
Detailed description of the datasets, including data type and shape
Metadata uses machine-readable keywords
Metadata contains resource identifier
(Meta)data are indexed in a searchable resource

Accessible

Datasets are published with CC-BY 4.0 licenses
Training_Set27 and Validation_Set28 are open datasets
(Meta)data are retrievable by their identifier using a standardized communications protocol
Metadata remains discoverable, even in the absence of the datasets

Interoperable

Datasets are published in open HDF5 format
Metadata contains qualified references to related data29 and publications25
Metadata follows standards for X-ray spectroscopy, uses controlled vocabularies, ontologies and good data model following a physics classification scheme developed by the American Physical Society (APS)30. These datasets were collected and curated to streamline their use for AI analyses31
Uses FAIR vocabulary following the ten rules provided in ref. 32

Reusable

Author list, and points of contact including email addresses
Datasets include Jupyter notebooks to explore, understand and visualize the datasets
Datasets include Jupyter notebooks that show how to use the Training_Set27 to train AI models with the scripts we have released in GitHub33
Datasets include Jupyter notebooks that show how to use the Validation_Set28 to conduct AI inference using the computational framework we described above
Datasets include a description of how they were produced and provide this information in a machine-readable metadata format

We have also created BDBags and minids for each of these datasets to ensure that creation, assembly, consumption, identification, and exchange of these datasets can be easily integrated into a user’s workflow. A BDBag is a mechanism for defining a dataset and its contents by enumerating its elements, regardless of their location. Each BDBag has a data/directory containing (meta)data files, along with a checksum for each file. A minid for each of these BDBags provides a lightweight persistent identifier for unambiguously identifying the dataset regardless of their location. Computing a checksum of BDBag contents allows others to validate that they have the correct dataset, and that there is no loss of data. In short, these tools enable us to package and describe our datasets in a common manner, and to refer unambiguously to the datasets, thereby enabling efficient management and exchange of data.

1.

BDBag for training set34 and its associated minid:olgmRyIu8Am7
2.

BDBag for validation set35 with its associated minid:16RmizZ1miAau

In addition to the FAIR properties listed above, we have ensured that our datasets meet the detailed FAIR guidelines described in ref. 9.

FAIR AI models

The understanding of FAIR principles in the creation and sharing of AI models is an active area of research. Here we contribute to this effort by introducing a set of measurable FAIR principles for AI models, with an emphasis on practical applications. We propose that all these principles are quantified as Pass or Fail. We have constructed the definitions of FAIR principles for AI models taking into account the diverse needs and level of expertise of AI practitioners. In practice, we have considered three types of AI models, which have the same AI architecture, as described in this GitHub repository33, and which are trained with the same Training_Set27. However, as shown in Fig. 1 these AI models have distinct features: (i) a baseline AI model was trained using PyTorch on GPUs in the ThetaGPU supercomputer at ALCF; (ii) the fully trained AI model in (i) was optimized with NVIDIA TensorRT, a software development kit (SDK) that enables low latency and high throughput AI-inference, in the ThetaGPU supercomputer at ALCF; and (iii) the AI model in (i) was trained using the SambaNova DataScale® system at the ALCF AI-Testbed. These cases exemplify different needs and levels of expertise in the development of AI tools. However, we show below that irrespective of the skill set of AI practitioners, hardware used, and target application, all these fully trained AI models deliver consistent results. In what follows, an AI model refers to a fully trained AI model.

Findable

Proposition

An AI model is findable when a DOI may direct a human or machine to a digital resource that contains all the required information to define uniquely the AI model, i.e., descriptive and rich AI model metadata that provides the title of the model, authors, DOI, year of publication, free text description, information about the input and output data type and shape, dependencies (TensorFlow, PyTorch, Conda, etc.) and their versions used to create and containerize the AI model. The published AI model should also include instructions to run it, a minimal test set to evaluate its performance, and the actual fully trained AI model in a user-/machine-friendly format, such as a Jupyter notebook or a container.

This work We have published three AI models in DLHub, and assigned DOIs to each of them: (i) a traditional PyTorch model; (ii) an NVIDIA TensorRT version of the traditional PyTorch model; and (iii) a model trained on the SambaNova DataScale® system:

1.

PyTorch Model36
2.

TensorRT Model37
3.

SambaNova Model38

Each of these AI models includes rich and descriptive metadata following the DataCite metadata standard, and is available as JSON formatted responses through a REST API or Python SDK. Furthermore, DLHub provides an interface that enables users to seamlessly run AI models following step-by-step examples, and explore the AI model’s metadata in detail.

Accessible

Proposition

An AI model is accessible when it is discoverable by a human or machine, and it may be downloaded (to further develop it, retrain it, optimize it for accelerated inference, etc.) or directly invoked to conduct AI inference. The AI model’s metadata should also be discoverable even in the absence of the AI model.

This work Our AI models are freely accessible through their DOIs provided by DLHub. The AI models may be freely downloaded or invoked over the network for AI inference. To do the latter, we have deployed funcX endpoints at the ThetaGPU supercomputer, which may be used to invoke the AI models and access the datasets that we have published at DLHub and the MDF, respectively. Users may interact with published AI models in DLHub by submitting HTTP requests to a REST API. In essence, the DLHub SDK contains a client that provides a Python API to these requests and hides the tedious operations involved in making an HTTP call from Python. The schema used by DLHub is such that the AI model’s metadata provides explicit information about the AI model’s identifier and its type (in this case a DOI). Thus, this schema ensures that the AI model’s metadata remains discoverable, even in the absence of the AI model.

Interoperable

Proposition

An AI model is interoperable when it may be readily used by machines to conduct AI-driven inference across disparate hardware architectures. This property may be realized by containerizing the model and providing infrastructure to enable AI models to process data in disparate hardware architectures. Furthermore, the AI model’s metadata should use a formal, accessible, and broadly used format, such as JSON or HTML.

This work We have quantified this property by evaluating the performance of our AI models across disparate hardware architectures. For instance, we have run our three AI models using RDUs, CPUs, and GPUs available at ALCF. Furthermore, as we describe below, we published these models in DLHub with metadata available as both JSON and HTML. Furthermore, the AI models’ metadata in DLHub follows the vocabulary of the DataCite metadata standard.

Reusable

Proposition

An AI model is reusable when it may be used by humans or machines to reproduce its putative capabilities for AI-driven analyses, and when it contains quantifiable metrics that inform users whether it may be used to process datasets that differ from those originally used to create it. This property can be achieved by providing information about the AI model’s required input and output data types and shapes; examples that show how to invoke the model; and a control dataset and uncertainty quantification metrics that indicate the realm of usability of the model. These reusability metrics may also be used to identify when a model is no longer trustworthy, in which case active learning, transfer learning, or related methods may be needed to fine tune the model so it may provide trustworthy predictions. The AI model’s metadata should also provide detailed provenance about the AI model, i.e., authors, dependencies and their versions used to create and containerize the model, datasets used to train the model, year of publication, and a brief description of the model.

This work Our models published in DLHub include examples that describe the input data type and shape, the output data type and shape, and a sample dataset to quantify their performance. We have also provided domain-informed metrics to ascertain when the predictions of our AI models are trustworthy. DLHub SDK provides step-by-step examples that show how to run the models and explore the AI model’s metadata, which provides detailed provenance of the AI model. In this study we use the L2 norm or Euclidean distance, a well-known metric for quantifying the performance of AI models for regression.

Expected outcome of proposed FAIR principles for AI models

In the same vein as FAIR principles for scientific data aim to automate data management, the FAIR principles for AI models we proposed above aim to maximize the impact of AI tools and methodologies, and to facilitate their adoption and further development with a view to eventually realize autonomous AI-driven scientific discovery. To realize that goal it is essential to link FAIR and AI-ready datasets with FAIR AI models within a flexible and smart computing fabric that streamlines the creation and use of AI models in scientific discovery. The domain-agnostic computing framework we described above represents a step in that direction, since it brings together scientific data infrastructure (DLHub & funcX & Globus), modern computing environments (ALCF and ALCF AI-Testbed), and leverage FAIR & AI-ready datasets (MDF).

This ready-to-use framework addresses common challenges in the adoption and use of AI methodologies, namely, it provides examples that illustrate how to use AI models, and their expected input and output data; it enables users to readily use, download, or further develop AI models. We have created this framework because, as AI practitioners, we are acutely aware that a common roadblock to using existing AI models is that of deploying a model on a computing platform but finding that, for example, library incompatibilities slow down progress and lead to duplication of efforts, or discourage researchers from investing limited time and resources in the adoption of AI methodologies. Our proposed framework addresses these limitations, and demonstrates how to combine computing platforms, container technologies, and tools to accelerate AI inference. Our proposed framework also highlights the importance of including uncertainty quantification metrics in AI models that inform users on the reliability and realm of applicability of AI predictions.