Summarize Data

 

Published: August 13, 2015

Updated: March 2, 2017

Generates a basic descriptive statistics report for the columns in a dataset

Category: Statistical Functions

You can use the Summarize Data module to create a set of standard statistical measures that describe each column in the input table. The module does not return the original dataset. Instead, it generates a row for each column, beginning with the column name and followed by relevant statistics for that column, based on its data type.

Such reports are useful when you want to understand the characteristics of the complete dataset. For example, you might need to know:

  • How many missing values are there in each column?

  • How many unique categorical values are there in a feature column?

  • What is the mean and standard deviation of the column?

You can get a partial list of statistics by using the Visualize option in any module that outputs a dataset, but the visualization includes only some top number of rows. By outputting the statistics in a tabular dataset, you can use the data in BI reporting tools or provide the values as input to another custom operation in the experiment.

  1. Add the Summarize Data module to your experiment. You can find this module in the Statistical Functions group in the experiment items list in Azure Machine Learning Studio.

  2. Connect the dataset for which you want to generate a report.

    If you want to report on only some columns, use the Select Columns in Dataset module to project a subset of columns to work with.

  3. No additional parameters are required. By default, the module analyzes all columns that are provided as input, and depending on the type of values in the columns, outputs a relevant set of statistics as described in the Results section.

  4. Run the experiment, or right-click the module, and select Run selected.

Results

The report from the module can include the following statistics. Some statistics might not be computed, depending on the column data type. See the Technical Notes section for details.

Column nameDescription
FeatureName of the column
CountCount of all rows
Unique Value CountNumber of unique values in column
Missing Value CountNumber of unique values in column
MinLowest value in column
MaxHighest value in column
MeanMean of all column values
Mean DeviationMean deviation of column values
1st QuartileValue at first quartile
MedianMedian column value
3rd QuartileValue at third quartile
ModeMode of column values
RangeInteger representing the number of values between the maximum and minimum values
Sample VarianceVariance for column; see Note
Sample Standard DeviationStandard deviation for column; see Note
Sample SkewnessSkewness for column; see Note
Sample KurtosisKurtosis for column; see Note
P0.50.5% percentile
P11% percentile
P55% percentile
P9595% percentile
P99.599.5% percentile
System_CAPS_ICON_note.jpg Note

The module calculates the statistics on the assumption that the statistical instances belong to a representative sample of a population. if you need statistics calculated for the population, use the options in the Compute Elementary Statistics module, which can compute either sample or population statistics.

For examples of how to use the Summarize Data module in an experiment, see these sample experiments in the Model Gallery:

  • For numeric and Boolean columns, you can output the mean, median, mode, and standard deviation.

  • For non-numeric columns, only the values for Count, Unique value count, and Missing value count are computed. For other statistics, a null value is returned.

  • Columns that contain Boolean values are processed as follows:

    • When calculating Min, a logical AND is applied.

    • When calculating Max, a logical OR is applied.

    • When computing Range, the module first checks whether the number of unique values in the column equals 2.

    • When computing any statistic that requires floating-point calculations, values of True are treated as 1.0, and values of False are treated as 0.0.

NameTypeDescription
DatasetData TableInput dataset
NameTypeDescription
Results datasetData TableA profile of the input dataset that contains descriptive statistics

For a complete list of error messages, see Module Error Codes.

ExceptionDescription
Error 0003Exception occurs if one or more inputs are null or empty.
Error 0020Exception occurs if the number of columns in some of the datasets passed to the module is too small.
Error 0021Exception occurs if the number of rows in some of the datasets passed to the module is too small.

Statistical Functions
Compute Elementary Statistics
descriptive
A-Z Module List

Show: