SAMPLE Expression (U-SQL)
Updated: September 14, 2017
USQL supports samplers natively. The sample expression is needed to access some types of samplers.
Given an input rowset and arguments that are specific to the types of sampler being invoked, the sample expression outputs a rowset that is a statistical sample of the input. Moreover, the sample expression takes as input an optional identifier (column) that records the probability with which the corresponding row has been output.
All samplers execute in one pass on data and in parallel on portions of the input; they do not require partitioning, shuffle etc. Samplers have been implemented in a single pass over data and have very small memory footprint; log( SIZE(input), SIZE(output) ).
| Syntax |
|---|
Sample_Expression_Complex :=
'SAMPLE' Rowset_Source Sampler_Details [Weight_Col] |
Semantics of Syntax Elements
UNIFORM
Rows are picked uniformly at random with probability equal torow_fraction. The weight column, if requested, is set to 1/(row_fraction) for all rows. The size of the output is governed by a binomial distribution. In expectation, the size of output isrow_fraction* SIZE(input rowset).row_fraction
A double between 0 and 1 that indicates the probability with which a row in the input rowset will be passed.Identifier_List
This option specifies the list of columns that define the groups.Syntax Identifier_List := Quoted_or_Unquoted_Identifier {',' Quoted_or_Unquoted_Identifier }.UNIVERSE
All rows are passed where the group value (i.e., the value of the columns inIdentifier_List) is in some randomly chosenrow_fractionof the space of group values.UNIVERSEuses a cryptographically strong hash function to pick a random portion of the values. The weight column, if requested, is set to 1/(row_fraction) for all rows.UNIVERSEensures sample-then-join is equivalent to join-then-sample: UsingUNIVERSEbefore an equijoin withIdentifier_List, on both inputs of the equijoin, containing exactly the equijoin columns is identical to sampling after the join.The size of the output is also in expectation
row_fraction*SIZE(input rowset). However, especially if there are too few groups, the output size has more variance than withUNIFORMsince all rows from a group are passed by the sampler or not.DISTINCT
Per group (a distinct value of the columns inIdentifier_List), this sampler passesmin_row_countrows and the rest of the rows are passed with probabilityrow_fraction. The weight column, if requested, is set to 1/(row_fraction) if the row is passed in the probabilistic mode and 1 otherwise.DISTINCTfacilitates sample-before-groupby: UsingDISTINCTbefore a groupby, withIdentifier_Listcontaining at least the group in the group-by, guarantees that the sample will not miss any groups.DISTINCTmay return fewer thanmin_row_countrows (but never less than 1) for some groups. A simple case is when a group has fewer thanmin_row_countrows in the input. A more complex case occurs depending on the degree of parallelism of the stage that runs the sampler and how the rows corresponding to the group are distributed among the input partitions.
Ensure your sample probability, |
Examples
- The examples can be executed in Visual Studio with the Azure Data Lake Tools plug-in.
- The scripts can be executed locally. An Azure subscription and Azure Data Lake Analytics account is not needed when executed locally.
- The examples below are based on the dataset defined below. Ensure your execution includes the rowset variable,
@data.
Dataset
@data =
SELECT * FROM
( VALUES
("02143", 30, 100, "Smith"),
("98052", 25, 50, "Andersen"),
("02139", 30, 75, "Todd"),
("02139", 25, 60, "Roberts"),
("61801", 23, 80, "Sanders")
) AS T(ZipCode, Age, Salary, LastName);
UNIFORM
@result =
SAMPLE @data UNIFORM (0.4) WITH WEIGHT AS _Weight;
OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/uniform.txt"
USING Outputters.Tsv();
Possible output below. Note that the output need not contain every ZipCode or all rows that have a given Age value. The likelihood of seeing this output is 0.03456.
| ZipCode | Age | Salary | LastName | Weight |
|---|---|---|---|---|
| 02143 | 30 | 100 | Smith | 2.5 |
| 98052 | 25 | 50 | Andersen | 2.5 |
UNIVERSE
Samples the @data rowset by picking some randomly chosen fraction of the overall value-space of the ON columns and returning all rows whose value of the ON columns belongs in the chosen space. The query implicitly picks 10% of all the values of the Age columns and returns all rows whose value of Age was randomly chosen.
@result =
SAMPLE @data ON Age UNIVERSE (0.1);
OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/universe.txt"
USING Outputters.Tsv();
Possible output below. Note that all rows with Age=30 will be picked if any of the Age=30 rows are picked. Similar all rows with Age=25 will either be picked or not be picked (as in this example).
The likelihood of seeing this output is 0.081; this is one of the three outcomes that have the same probability after no-output which occurs with a likelihood of 0.729 since there are only three unique values for Age. Note: An empty result set may be returned in this example due to the low probability and small dataset.
| ZipCode | Age | Salary | LastName |
|---|---|---|---|
| 02143 | 30 | 100 | Smith |
| 02139 | 30 | 75 | Todd |
DISTINCT
Samples the @data rowset such that at least 3 rows per distinct value of the ZipCode column are included, and additional rows are added with the indicated probability (10%). The weight of each passing row is added to the resulting rowset in the weight column _Weight.
@result =
SAMPLE @data ON ZipCode DISTINCT (0.1, 3) WITH WEIGHT AS _Weight;
OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/distinct.txt"
USING Outputters.Tsv();
Possible output below. Note that every ZipCode is represented in the output by up to 3 rows, if as many rows are available in the input. The likelihood of seeing this output is 1.
| ZipCode | Age | Salary | LastName | Weight |
|---|---|---|---|---|
| 02143 | 30 | 100 | Smith | 1 |
| 98052 | 25 | 50 | Andersen | 1 |
| 02139 | 30 | 75 | Todd | 1 |
| 02139 | 25 | 60 | Roberts | 1 |
| 61801 | 23 | 80 | Sanders | 1 |