Table of contents
Collapse the table of content
Expand the table of content

U-SQL Scripts

Michael Rys|Last Updated: 5/15/2018
4 Contributors

The main unit of a U-SQL “program” is a U-SQL script. A script consists of a sequence of U-SQL statements.

Script :=                                                                                                

Semantics of Syntax Elements


The following shows an extended sample script that operates on one of the U-SQL sample data sets:

DECLARE @in  string = "/Samples/Data/SearchLog.tsv";  
DECLARE @out string = "/output/result.tsv";  

@searchlog =  
    EXTRACT UserId          int,  
            Start           DateTime,  
            Region          string,  
            Query           string,  
            Duration        int?,  
            Urls            string,  
            ClickedUrls     string  
    FROM @in  
    USING Extractors.Tsv();  

@rs1 =  
    SELECT Start, Region, Duration  
    FROM @searchlog  
    WHERE Region == "en-gb";  

@rs1 =  
    SELECT Start, Region, Duration  
    FROM @rs1  
    WHERE Start >= DateTime.Parse("2012/02/16") AND Start <= DateTime.Parse("2012/02/17");  

OUTPUT @rs1     
TO @out  
USING Outputters.Tsv();

The script starts with two DECLARE variable statements that declare the variables @in and @out that contain the input and output path names respectively. Then it assigns the variable @searchlog to the EXTRACT expression which then will be used in the next SELECT expression, which does a filter on the Region column. This SELECT expression will be assigned to the variable @rs1. The @rs1 variable will be further refined with the next SELECT expression that adds a range filter on the Start property before it is written to the output file using an OUTPUT statement.

Hopefully, you have stumbled over the wording “assigned the variable to the statement” in the paragraph above. This wording is calling out an important aspect of U-SQL scripts and the processing of a U-SQL script. While it seems that we assign the resulting rowset of each expression to a variable that we then process in the next step, in fact U-SQL is NOT calculating the result at this point in the script.

Instead each variable you see is actually a name for the expression that it is being assigned to. What this means is that U-SQL is not executing each statement sequentially, but composes bigger and bigger statements until there is a big expression tree that then gets optimized and executed as a whole. In language theory this is also known as lambda expression composition.

As a consequence, a single U-SQL script can contain several such expression trees that will be all executed in parallel without producing intermediate state. This gives the U-SQL execution the ability to optimize expressions across the whole script. For example, if none of the later expressions refer to a column that gets extracted in the beginning, the optimizer may decide to not actually extract the column, or a predicate that gets applied to a later rowset is being pushed to be executed much earlier to reduce the amount of data that gets processed in the intermediate steps.

However, it also means that one cannot perform intermediate investigations of state to determine different workflows inside the script. Such investigations either have to be written within the declarative framework or be delegated to an external workflow that will orchestrate individual scripts that produce intermediate snapshot results in files to help with the workflow orchestration.

The following images show the logical expression tree and the physical job graph of the above script as shown in the Azure DataLake Tool for VisualStudio.

Logical Expression Tree

U-SQL Logical Job Execution Tree

Physical Job Graph

U-SQL Physical Job Execution Tree

See Also

© 2018 Microsoft