April 2019

Volume 34 Number 4

[Azure Confidential Computing]

Secure Multi-Party Machine Learning with Azure Confidential Computing

By Stefano Tempesta

Security is paramount when implementing business solutions hosted in a public cloud, especially when sensitive data and intellectual property are stored as part of the solution itself. There are best practices in the industry already for securing data at rest and in transit, but you also need to protect your data from unauthorized access when it’s being used. Azure confidential computing (bit.ly/2BxQkpp) provides a new level of data protection and encryption when your data is being processed, using Trusted Execution Environments (TTEs). TEEs are implemented at the hardware or software (hypervisor) level and provide a protected processor and memory space where your code and data can run in complete isolation from outside applications and systems. TEEs ensure there’s no way to view data or the operations inside from the outside, even with a debugger. They even ensure that only authorized code is permitted to access data. If the code is altered or tampered, the operations are denied and the environment disabled. The TEE enforces these protections throughout the execution of code within it.

SQL Server Always Encrypted with Secure Enclaves

In my previous article, “Protect Your Data with Azure Confidential Computing” (msdn.com/magazine/mt833273), I introduced Azure confidential computing and the Open Enclave SDK for building software applications that run protected code and data in a trusted execution environment. The same technology is in use for Azure SQL Database and SQL Server. This is an enhancement to the Always Encrypted capability, which ensures that sensitive data within a SQL database can be encrypted at all times without compromising the functionality of SQL queries. Always Encrypted with Secure Enclaves achieves this by delegating computations on sensitive data to an enclave, where the data is safely decrypted and processed. Available with SQL Server 2019, the enclave technology adopted, called Virtualization Based Security (VBS), isolates a region of memory within the address space of a user-mode process that’s completely invisible to all other processes and to the Windows OS on the machine. Even machine administrators aren’t able to see the memory of the enclave. Figure 1shows what an admin would see when browsing the enclave memory with a debugger (note the question marks, as opposed to the actual memory content).

Browsing a Virtualization Based Security Enclave with WinDbg
Figure 1 Browsing a Virtualization Based Security Enclave with WinDbg

The way the enhanced Always Encrypted feature uses enclaves is illustrated in Figure 2. The SQL Server instance contains an enclave, loaded with the code for performing data encryption, as well as the code implementing SQL operations. Before submitting a query to the SQL Server engine for processing, the SQL driver used by the client application sends the encryption keys to the enclave over a secure channel. When processing queries, the SQL Server engine runs these operations directly within the enclave, where the data is safely decrypted and processed. Data never leaves the enclave unencrypted.

Communication to SQL Server Is Always Encrypted with Secure Enclaves
Figure 2 Communication to SQL Server Is Always Encrypted with Secure Enclaves

Not all data in a database table requires encryption. Specific columns can be identified for securing data in an enclave. When parsing a SQL query, the SQL Server engine determines if the query contains any operations on encrypted data that require the use of the secure enclave. For queries where the secure enclave needs to be accessed, the client driver sends the encryption keys for the specific table columns required for the operations to the secure enclave, and then it submits the query for execution along with the encrypted query parameters. Data contained in the identified secured columns is never exposed in the clear outside of the enclave. SQL Server decrypts data contained in these columns only within the secure enclave. If the query contains parameters on the secured columns, these parameters are also decrypted within the enclave before the query runs.

Columns that contain sensitive data to encrypt can be identified with an ALTER COLUMN statement. This statement is executed within the enclave. There’s no need to move data out of the database for initial encryption or for other encryption-related schema changes. This improves the performance and reliability of such operations greatly, and it doesn’t require special client-side tools. Assuming you want to protect the age of patients, contained in the Age column of the MedicalRecords database table, the following statement encrypts data in that column with the AES-256 encryption algorithm:

ALTER TABLE [dbo].[MedicalRecords]
ALTER COLUMN [Age] [char](11)
ENCRYPTED WITH (COLUMN_ENCRYPTION_KEY = [CEK1],
  ENCRYPTION_TYPE = Randomized,
  ALGORITHM = 'AEAD_AES_256_CBC_HMAC_SHA_256') NOT NULL
GO

You can find comprehensive documentation on the new capa­bilities and on how to get started with Always Encrypted with Secure Enclaves at aka.ms/AlwaysEncryptedwithSecureEnclaves.

Multi-Party Sensitive Data Sources

In addition to SQL Server, there’s potentially a broad application of Azure confidential computing across many industries, including finance, retail, government and health care. Even competing organizations would benefit from pooling their private datasets, training machine learning models on the aggregate data to achieve higher accuracy of prediction. For example, health institutes could collaborate by sharing their private patient data, like genomic sequences and medical records, to gain deeper insights from machine learning across multiple datasets, without risk of data being leaked to other organizations. Having more data to train on allows machine learning algorithms to produce better models. Confidential computing can be used for privacy: multiple parties leverage a common machine learning service to be executed on their aggregate data. Although health institutes, in this use case, wouldn’t want their own datasets to be shared with any another party, they can each upload their encrypted data in an enclave, perform remote attestation, run the machine learning code and, finally, download the encrypted machine learning model. The model may also be kept within the enclave for secure evaluation by all of the parties, subject to their agreed access control policies.

Figure 3 provides an overview of such a system. Multiple hospitals encrypt patient datasets, each with a different key. The hospitals deploy an agreed-upon machine learning algorithm in an enclave in a cloud data­center and share their data keys with the enclave. The enclave processes the aggregate datasets and outputs an encrypted machine learning model.

Secure Multi-Party Machine Learning
Figure 3 Secure Multi-Party Machine Learning

In this scenario, sensitive data is stored in a SQL Server database with Always Encrypted with Secure Enclaves, and shared with a machine learning service that runs in a TTE. Organizations use the Open Enclave SDK (openenclave.io) to build portable C or C++ applications against different enclave types. With the SDK, they also take advantage of enclave creation and management, system primitives, runtime support, and cryptographic library support by using a consistent API and an enclaving abstraction. The machine learning application benefits from combined multiple data sources to produce better training models. As data is protected by SQL Server, it’s the responsibility of the SQL engine to authenticate the machine learning enclave before sharing any data with it. Once this attestation is made, data is decrypted and visible solely to the machine learning code, without revealing it to other applications, even if running on the same virtual machine or app service, or the hosting cloud platform.

Enclave Attestation

The secure enclave inside the SQL Server engine can access sensitive data stored in encrypted database columns and the corresponding column encryption keys. But before an application can send a SQL query that involves enclave computations to SQL Server, it should verify that the enclave is a genuine enclave based on a given technology—VBS for example—and the code running inside the enclave has been properly signed. The process of verifying the enclave is called enclave attestation, and, in the case of SQL Server, it involves the SQL client driver used by the client application to contact an external attestation service. The specifics of the attestation process depend on the enclave technology and the attestation service. The attestation process SQL Server supports for VBS secure enclaves in SQL Server 2019 is called Windows Defender System Guard runtime attestation, which uses Host Guardian Service (HGS) as an attestation service. You need to configure HGS in your environment and register the machine hosting your SQL Server instance in HGS. You also have to configure your client applications with an HGS attestation.

To establish trust among parties, VBS enclaves expose an enclave attestation report that’s fully signed by a VBS-unique key. An attestation service uses this report to establish a trust relationship between a client application and an enclave. Essentially, attestation involves the enclave report, which is proof of the identity and integrity of the TEE application, and proof of platform integrity (which includes measurements of host and TEE). The VBS unique key is used to link the two because it’s used to sign the report and is stored in the measurements. HGS validates the platform measurements and issues a health certificate with the key in it. The client then validates the entire report using the VBS key, validates the VBS key itself using the health certificate, and eventually also validates the health certificate issued by the HGS it trusts. An enclave gets the keys to encrypt and decrypt data via a secure tunnel established by the client driver and the enclave itself. A session key is generated for communication over this channel. Then, the client driver encrypts the columns encryption key with the session key, and signs and submits SQL queries that require enclave computations. Figure 4extends the communication protocol of SQL Server Always Encrypted with Secure Enclaves to the use of an Attestation Service.

Enclave Attestation for a SQL Server Database
Figure 4 Enclave Attestation for a SQL Server Database

To use Always Encrypted with secure enclaves, an application must use a client driver that supports the feature. In SQL Server 2019, your applications should use .NET Framework 4.6 or higher and .NET Framework Data Provider for SQL Server. In addition, .NET applications must be configured with a secure enclave provider specific to the enclave type (for example, VBS) and the attestation service (for example, HGS), you’re using. The supported enclave providers are shipped separately in a NuGet package, which you need to integrate with your application. An enclave provider implements the client-side logic for the attestation protocol and for establishing a secure channel with a secure enclave of a given type. Always Encrypted isn’t currently supported in .NET Core.

You can find step-by-step instructions for configuring SQL Server Always Encrypted with secure enclaves using SQL Server Management Studio (SSMS) in the tutorial at bit.ly/2Xig9nr. The tutorial goes through the installation of an HGS computer to run HGS to support host key attestation, and then the configuration of SQL Server as a guarded host registered with HGS using host key attestation. Please note that the purpose of the tutorial is to create a simple environment for testing purposes only. Host key attestation is recommended for use in test environments. You should use Trusted Platform Module (TPM) attestation for production environments. TPM is a technology designed to provide hardware-based security functions. A TPM chip is a secure crypto-processor designed specifically to carry out cryptographic operations directly inside the CPU. The chip includes multiple physical security mechanisms to make it tamper-resistant to software applications. If you’re interested in knowing more about TPM and how to initialize HGS using TPM-trusted attestation, you can start with bit.ly/2T3lVuN.

To connect to an instance of SQL Server Always Encrypted, the connection string of your client application should include two additional parameters:

  • Column Encryption Setting = Enabled
  • Enclave Attestation URL=https://<Your-HGS-Server>/Attestation

For example, the connection string for a .NET application would be similar to the following setting:

<connectionStrings>
  <add name="DataConnection" connectionString="Data Source=.;
    Initial Catalog=[Database Name]; Integrated Security=true;
      Column Encryption Setting = Enabled;
        Enclave Attestation URL=https://[Your-HGS-Server]/Attestation" />
</connectionStrings>

Prediction in Machine Learning

The machine learning application I’m presenting in this article is built with ML.NET (dot.net/ml), an open source and cross-platform machine learning framework for .NET. Currently still in development, ML.NET enables machine learning tasks like classification and regression. The framework also exposes APIs for training models, as well as core components such as data structures, transformations and learning algorithms.

For the health care use case, the machine learning application implements a multi-class classification task that classifies patients into groups according to the likelihood they’ll develop a specific disease (to keep things less morbid here, I’ll refer to a generic disease rather than naming one specifically). As anticipated in the description of this multi-party data source scenario, health institutes share their encrypted datasets, containing all the necessary features for training a model.

The machine learning code is a Web API developed in C#. In your Visual Studio solution, add a reference to the Microsoft.ML NuGet package. This package contains the ML.NET library and the relevant namespaces to add to the application:

using Microsoft.ML;
using Microsoft.ML.Core.Data;
using Microsoft.ML.Data;

The Web API solution implements the POST method only, which accepts a JSON payload that describes a medical record. The MedicalRecord class in the solution, shown in Figure 5, identifies all the features analyzed by the ML.NET engine. Risk is the label of the dataset; it’s the value on which it’s being trained.

Figure 5 The MedicalRecord Class

public class MedicalRecord
{
  public float Risk { get; set; }
  public int Age { get; set; }
  public int Sex { get; set; }                // 0 = Female, 1 = Male
  public bool Smoker { get; set; }
  public ChestPainType ChestPain { get; set; }
  public int BloodPressure { get; set; }      // In mm Hg on admission to
                                              // the hospital
  public int SerumCholestoral { get; set; }   // In mg/dl
  public bool FastingBloodSugar { get; set; } // True if > 120 mg/dl
  public int MaxHeartRate { get; set; }       // Maximum heart rate achieved
}
public enum ChestPainType
{
  TypicalAngina = 1,
  AtypicalAngina,
  NonAnginal,
  Asymptomatic
}

The JSON message request assumes the following format:

{
  "risk": 0.0,
  "age": 0,
  "sex": 0,
  "smoker": false,
  "chestPain": 0,
  "bloodPressure": 0,
  "serumCholestoral": 0,
  "fastingBloodSugar": false,
  "maxHeartRate": 0
}

The Classification class implements the multi-class classification task that, given a medical record in input, assigns that patient to a risk class of contracting the disease. This process develops over the following four steps:

  1. Instantiate an MLContext object that represents the ML.NET context class.
  2. Train the model based on historical datasets available in SQL Server (and protected within an enclave).
  3. Evaluate the model.
  4. Score it.

The output is returned to the client application (the health institute) in the form of a JSON response:

{
  "score": 0.0,
  "accuracy": 0.0
}

But let’s go in order and describe each step in the machine learning code. The first step is to create an instance of MLContext as part of the Classification class, with a random seed (seed: 0) for repeatable and deterministic results across multiple trainings. MLContext is your reference object for access to source data and trained models, as well as the relevant machine learning transformers and estimators:

public Classification()
{
  _context = new MLContext(seed: 0);
}
private MLContext _context;

The next step in the machine learning process is training the model, as shown in Figure 6. This is the step that loads data from the SQL Server database in the secure enclave. A connection to SQL Server is created using the SqlConnection class, which internally works out the attestation and communication with the enclave over a secure channel.

Figure 6 Training the Model

public IEnumerable<MedicalRecord> ReadData(int split = 100)
{
  IEnumerable<MedicalRecord> medicalRecords = new List<MedicalRecord>();
  using (var connection = new SqlConnection(ConnectionString))
  {
    connection.Open();
    using (SqlCommand cmd = connection.CreateCommand())
    {
      cmd.CommandText = SqlQuery(split);
      using (SqlDataReader reader = cmd.ExecuteReader())
      {
        if (reader.HasRows)
        {
          while (reader.Read())
          {
            medicalRecords.Append(ReadMedicalRecord(reader));
          }
        }
      }
    }
  }
  return medicalRecords;
}

Please note the split parameter that was used to select a percentage of records from the database, using the tablesample functionality in SQL Server. Training and evaluation of a model require a different subset of data as input. A good approximation is to allocate 80 percent of data for training, depending on its quality. The higher the quality of input, the better the accuracy of prediction:

private static string SqlQuery(int split) =>
  $"SELECT [Age], [Sex], [Smoker], [ChestPain],
    [BloodPressure], [SerumCholestoral], [FastingBloodSugar],
      [MaxHeartRate] FROM [dbo].[MedicalRecords] tablesample({split} percent)";

Data is then loaded into the current ML context, and a trained model is obtained using a method that predicts a target using a linear multi-class classification model trained with the Stochastic Dual Coordinate Ascent (SDCA) algorithm, a process for solving large-scale supervised learning problems. The multi-class classification trainer is added to a pipeline of trainers before the actual “fit”; that is, before the training of the model happens. ML.NET transforms the data employing a chain of processes that apply a custom transformation before training or testing.

The purpose of the transformation in the TrainModel method, shown in Figure 7, is to label source data that specifies the format the machine learning algorithm recognizes. To accomplish this featurization of data, the transformation pipeline combines all of the feature columns into the Features column using the concatenate method. But before that, I need to convert any non-float data to float data types as all ML.NET learners expect features to be float vectors.

Figure 7 Obtaining a Training Model

public void TrainModel(IDataSource<MedicalRecord> dataSource, Stream targetStream)
{
  IEnumerable<MedicalRecord> sourceData = dataSource.ReadData(80); 
    // 80% of records for training
  IDataView trainData = _context.Data.ReadFromEnumerable<MedicalRecord>(sourceData);
  // Convert each categorical feature into one-hot encoding independently
  var transformers = _context.Transforms.Categorical.OneHotEncoding(
    "MedicalRecordAge", nameof(MedicalRecord.Age))
    .Append(_context.Transforms.Categorical.OneHotEncoding(
      "MedicalRecordSex", nameof(MedicalRecord.Sex)))
    .Append(_context.Transforms.Categorical.OneHotEncoding(
      "MedicalRecordSmoker", nameof(MedicalRecord.Smoker)))
    .Append(_context.Transforms.Categorical.OneHotEncoding(
      "MedicalRecordChestPain", nameof(MedicalRecord.ChestPain)))
    .Append(_context.Transforms.Categorical.OneHotEncoding(
      "MedicalRecordBloodPressure", nameof(MedicalRecord.BloodPressure)))
    .Append(_context.Transforms.Categorical.OneHotEncoding(
      "MedicalRecordSerumCholestoral", nameof(
        MedicalRecord.SerumCholestoral)))
    .Append(_context.Transforms.Categorical.OneHotEncoding(
      "MedicalRecordFastingBloodSugar", nameof(
        MedicalRecord.FastingBloodSugar)))
    .Append(_context.Transforms.Categorical.OneHotEncoding(
      "MedicalRecordMaxHeartRate", nameof(MedicalRecord.MaxHeartRate)));
  var processChain = _context.Transforms.Concatenate(DefaultColumnNames.Features,
      nameof(MedicalRecord.Age),
      nameof(MedicalRecord.Sex),
      nameof(MedicalRecord.Smoker),
      nameof(MedicalRecord.ChestPain),
      nameof(MedicalRecord.BloodPressure),
      nameof(MedicalRecord.SerumCholestoral),
      nameof(MedicalRecord.FastingBloodSugar),
      nameof(MedicalRecord.MaxHeartRate)
  ).AppendCacheCheckpoint(_context);
  var trainer =
    _context.MulticlassClassification.Trainers.StochasticDualCoordinateAscent(
      labelColumn: DefaultColumnNames.Label,
        featureColumn: DefaultColumnNames.Features);
  var pipeline = processChain.Append(trainer);
  var model = pipeline.Fit(trainData);
  _context.Model.Save(model, targetStream);
}

All features in the medical record are basically categories and not float values. So, before concatenating them into a single output column, I apply a categorical transformation to each one of them using the OneHotEncoding function. By default, a learning algorithm processes only features from the Features column. The AppendCacheCheckpoint method at the end caches the source data, so when you iterate over the data multiple times using the cache, you might get better performance. The actual training happens when you invoke the Fit method on the estimator pipeline. Estimators learn from data and produce a machine learning model. The Fit method returns a model to use for predictions. Trained models can eventually be persisted to a stream, for future utilization.

Following the model training, model evaluation is essential for ensuring the good quality of the next step, model scoring, which is the outcome prediction. Evaluation of predicted data is performed against a dataset obtained as a transformation of test data from the trained model, using a percentage of the source data (in this case 20 percent), as shown in Figure 8. The Evaluate method of the MulticlassClassification object defined within the current ML.NET context computes the quality metrics for the model using the specified dataset. It returns a MultiClassClassifierMetrics object that contains the overall metrics computed by multi-class classification evaluators. In my example, I calculate the accuracy as a percentage of correct predictions out of the total amount of records evaluated. The TopKAccuracy metric describes the relative number of examples where the true label is one of the top K predicted labels.

Figure 8 Evaluating the Model

public double EvaluateModel(IDataSource<MedicalRecord> dataSource,
  Stream modelStream)
{
  IEnumerable<MedicalRecord> sourceData = dataSource.ReadData(20); 
    // 20% of records for evaluation
  var testData =
    _context.Data.ReadFromEnumerable<MedicalRecord>(sourceData);
  var predictions = _context.Model.Load(modelStream).Transform(testData);
  var metrics = _context.MulticlassClassification.Evaluate(
    data: predictions, label: DefaultColumnNames.Label,
      score: DefaultColumnNames.Score,
        predictedLabel: DefaultColumnNames.PredictedLabel, topK: 0);
  return metrics.TopKAccuracy / metrics.TopK * 100.0 / sourceData.Count();
}

The last step in the classification API is to score the model and generate a prediction that identifies a risk for a person with the given medical record. Scoring a model starts from loading the data model previously trained and saved, and then creating a prediction engine from that trained model. In the Score method, I load the trained model and then invoke the CreatePredictionEngine function with the medical record and risk prediction types (the MedicalRecord and RiskPrediction C# classes, respectively). The prediction engine generates a score that identifies the class of risk of a patient with the analyzed medical record:

public double Score(MedicalRecord medicalRecord, Stream modelStream)
{
  ITransformer trainedModel = _context.Model.Load(modelStream);
  var predictionEngine = trainedModel.CreatePredictionEngine<MedicalRecord,
    RiskPrediction>(_context);
  var prediction = predictionEngine.Predict(medicalRecord);
  return prediction.Score;
}

The RiskPrediction object contains the Score and Accuracy attributes returned by the Post method of the Web API. For the Score property, I added the ColumnName attribute to tell ML.NET that this is where I want to output the scored value produced by the Predict method of the prediction engine:

public class RiskPrediction
{    [ColumnName("Score")]
  public double Score;
  public double Accuracy;
}

Finally, as shown in Figure 9, the Post method of the health care Web API processes all these steps in sequence:

  1. Obtains a MedicalRecord object from its representation in JSON.
  2. Instantiates the Classification helper, which uses ML.NET for multi-class classification.
  3. Trains and saves the model.
  4. Calculates its accuracy
  5. Predicts the risk associated with the medical record provided in input.

Figure 9 Predicting the Risk of a Particular Medical Record

public IHttpActionResult Post(string medicalRecordJson)
{
  MedicalRecord medicalRecordObj =
    JsonConvert.DeserializeObject<MedicalRecord>(medicalRecordJson);
  Classification classification = new Classification();
  FileStream modelStream = new FileStream("[Path]", FileMode.Create);
  classification.TrainModel(new MedicalRecordDataSource(), modelStream);
  double accuracy = classification.EvaluateModel(
    new MedicalRecordDataSource(), modelStream);
  double score = classification.Score(medicalRecordObj, modelStream);
  RiskPrediction prediction = new RiskPrediction
  {
    Score = score,
    Accuracy = accuracy
  };
  return Json<RiskPrediction>(prediction);
}

Please consider that this code has a margin for performance improvement in caching trained models for further use, for example. Model training and evaluation can be skipped when working with pre-trained models, yielding a significant performance boost.

In my tests, I tried running the prediction process against individual datasets, obtained by health institutes, obtaining an average accuracy of 90 percent. Each dataset has roughly 100,000 records. When combining multiple datasets into a single one, accuracy reached 96 percent.

Wrapping Up

In this article, I showed how a machine learning application can produce better predictions by leveraging a broader dataset, by collecting data from several trusted sources without compromising confidentiality of the information shared among all parties involved. This is achieved by protecting data in use by the machine learning code within a SQL Server Always Encrypted enclave, which is part of the Azure confidential computing technology. The machine learning solution collects datasets from different protected sources and, by combining data into a larger dataset, obtains higher accuracy of prediction on the trained model.

It’s important to mention that confidential computing requires your code to run in a TEE. In the example presented in this article, the SQL client driver is the intermediator with the SQL Server trusted execution environment. The ML.NET code, therefore, may or may not run within a TEE. If it does, it’s called a secure multi-party machine learning computation. Multi-party computation implies that there are two or more parties of mutually exclusive trust that have data and/or code they would like to protect against each other. In this context, the parties would like to perform computation together to generate a shared output, without leaking any sensitive data. The enclave acts as a trusted intermediator for this computation.

The entire solution is available on GitHub at bit.ly/2Exp78V.


Stefano Tempesta  is a Microsoft Regional Director, MVP on AI and Business Applications, and member of Blockchain Council. A regular speaker at international IT conferences, including Microsoft Ignite and Tech Summit, Tempesta’s interests extend to blockchain and AI-related technologies. He created Blogchain Space (blogchain.space), a blog about blockchain technologies, writes for MSDN Magazine and MS Dynamics World, and publishes machine learning experiments on the Azure AI Gallery (gallery.azure.ai).

Thanks to the following Microsoft technical experts who reviewed this article: Gleb Krivosheev, Simon Leet
Simon Leet is the Principal Engineer in the Azure Confidential Computing product group.


Discuss this article in the MSDN Magazine forum