Table of contents
TOC
Collapse the table of content
Expand the table of content

SAMPLE (U-SQL)

Michael Rys|Last Updated: 5/15/2018
|
4 Contributors

U-SQL supports samplers natively. The SAMPLE clause enables convenient access from within a SELECT expression. All samplers execute in one pass on data and in parallel on portions of the input; they do not require partitioning, shuffle etc. Samplers have been implemented in a single pass over data and have very small memory footprint; log( SIZE(input), SIZE(output) ).

Syntax
Sample_Expression_Simple :=                                                                              
     Query_Expression 'SAMPLE' ( 'ANY' '(' row_count ')' | 'UNIFORM' '(' row_fraction ')' ).

Semantics of Syntax Elements

  • ANY
    Anyrow_count rows from the input rowset. ANY does not return a randomly chosen subset of rows; do not use where randomness is needed.

  • row_count
    A positive integer.

  • UNIFORM
    Rows are picked uniformly at random with probability equal to row_fraction. The size of the output is governed by a binomial distribution. In expectation, the size of output is row_fraction * SIZE(input rowset).

  • row_fraction
    A double between 0 and 1 that indicates the probability with which a row in the input rowset will be passed.

Tip

Ensure your sample probability, row_fraction, is adequate for the size of your dataset to minimize the possibility of an empty result set being returned.

Examples

  • The examples can be executed in Visual Studio with the Azure Data Lake Tools plug-in.
  • The scripts can be executed locally. An Azure subscription and Azure Data Lake Analytics account is not needed when executed locally.
  • The examples below are based on the dataset defined below. Ensure your execution includes the rowset variable, @data.

Dataset

@data = 
    SELECT * FROM 
        ( VALUES
        ("02143",   30,     100,  "Smith"),
        ("98052",   25,     50,   "Andersen"),
        ("02139",   30,     75,   "Todd"),
        ("02139",   25,     60,   "Roberts"),
        ("61801",   23,     80,   "Sanders")
        ) AS T(ZipCode, Age, Salary, LastName);

ANY
Samples the @data rowset by selecting the specified number of rows (2) in a completely arbitrary way.

@result = 
    SELECT * FROM @data SAMPLE ANY (2);

OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/any.txt"
USING Outputters.Tsv();

Possible output below.

ZipCodeAgeSalaryLastName
0214330100Smith
980522550Andersen

UNIFORM
Samples the @data rowset using a random uniform sampling with the provided probability (10%).

@result =
    SELECT * FROM @data SAMPLE UNIFORM (0.1);

OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/simple_uniform.txt"
USING Outputters.Tsv();

Possible output below. The likelihood of this output is 0.00729 (two rows are picked with a probability of 0.1 each and three rows not picked with a probability of 0.9 each). Note: An empty result set may be returned in this example due to the low probability and small dataset.

ZipCodeAgeSalaryLastName
0214330100Smith
980522550Andersen

See Also

© 2018 Microsoft