U-SQL SELECT Selecting from Joins
Updated: March 10, 2017
U-SQL provides the following ways of joining two rowsets:
The syntax follows the ANSI SQL join syntax. The comma separated implicit cross join syntax with a join predicate in a WHERE (U-SQL) clause is not supported in U-SQL. Either use an explicit CROSS JOIN or preferably use an INNER or OUTER JOIN instead.
| Syntax |
|---|
Join_Expression :=
Rowset_Source Cross_Join_Operator Rowset_Source
| Rowset_Source Join_Operator Rowset_Source Join_On_Clause.
|
Semantics of Syntax Elements
Rowset_Source
Specifies the rowsets that are being joined. For more details on the rowset source see U-SQL SELECT FROM Clause.Cross_Join_Operator
The cross join operator joins the two rowsets by combining each row on the left side with each row on the right side.Syntax Cross_Join_Operator := 'CROSS' [Parallel_Join_Hint] 'JOIN'.It allows the optional specification of a join hint. See Join Hints for more details.
Join_Operator
Specifies the join operation.Syntax Join_Operator := [Join_Prefix] Join_Hints 'JOIN' | ['LEFT' | 'RIGHT'] Join_Hints Semijoin_Operator.
Join_Prefix := 'INNER' | [['LEFT' | 'RIGHT' | 'FULL'] 'OUTER'].
Semijoin_Operator := 'SEMIJOIN' | 'ANTISEMIJOIN'.The join operators can either be a variety of joins or semijoins.
[Join_Prefix]'JOIN'
Specifies the type of join.JOINdefaults toINNER JOIN,OUTER JOINdefaults toLEFT OUTER JOIN. See the sections on joins for more details on each of the join expressions.['LEFT' | 'RIGHT']Semijoin_Operator
Specifies the type of semijoin.SEMIJOINandANTISEMIJOINboth default toLEFT SEMIJOINandLEFT ANTISEMIJOINrespectively. See the sections on semijoins for more details on each of the semijoin expressions.Join_Hints
The optional specification of join hints. See Join Hints for more details.
It is highly recommended to always fully specify the join operator.
Join_On_Clause
Specifies the join comparison that is being used to identify which of the rows from each rowset will be joined.Syntax Join_On_Clause := 'ON' Join_Comparison_Expression.Join_Comparison_Expression
Provides the join comparison. See Join Comparisons for more details.
Join Comparisons
U-SQL, like most scaled out Big Data Query languages that support joins, restricts the join comparison to equality comparisons between columns in the rowsets to be joined. If the column references are not unique across all the rowsets in the query, they have to be fully qualified with the rowset variable name, rowset name or specified alias.
| Syntax |
|---|
Join_Comparison_Expression :=
Join_Equality {'AND' Join_Equality}. |
There are however ways to still apply other join predicates. If one has a non-equality comparison or a more complex expression (such as a method invocation) in the comparison, one can move the comparison to the SELECT’s WHERE clause. Or the more complex expression can be placed in an earlier SELECT statement’s column and then that alias can be referred to in the join comparison.
U-SQL decided to limit the ON clause to simple equality only and not perform the described rewrites to make it clear to the user that the more complex join predicates will negatively impact performance of the joins, since they will hinder the use of indices and other join optimizations. |
Examples
- The examples are designed for execution in Visual Studio with the Azure Data Lake Tools plug-in.
- The scripts can be executed locally. An Azure subscription and Azure Data Lake Analytics account is not needed when executed locally.
- The examples below are based on the dataset defined below. Ensure your execution includes the rowset variable.
@employees =
SELECT * FROM
( VALUES
(1, "Noah", 100, (int?)10000, new DateTime(2012,05,31)),
(2, "Sophia", 100, (int?)15000, new DateTime(2012,03,19)),
(3, "Liam", 100, (int?)30000, new DateTime(2014,09,14)),
(4, "Amy", 100, (int?)35000, new DateTime(1999,02,27)),
(5, "Justin", 100, (int?)15000, new DateTime(2015,01,12)),
(6, "Emma", 200, (int?)8000, new DateTime(2014,03,08)),
(7, "Jacob", 200, (int?)8000, new DateTime(2014,09,02)),
(8, "Olivia", 200, (int?)8000, new DateTime(2013,12,11)),
(9, "Mason", 300, (int?)50000, new DateTime(2016,01,01)),
(10, "Ava", 400, (int?)15000, new DateTime(2014,09,14)),
(11, "Ethan", 400, (int?)9000, new DateTime(2015,08,22)),
(12, "David", 800, (int?)100, new DateTime(2016,11,01)),
(13, "Andrew", 100, (int?)null, new DateTime(1995,07,16)),
(14, "Jennie", 100, (int?)34000, new DateTime(2000,02,12))
) AS T(EmpID, EmpName, DeptID, Salary, StartDate);
A. How to write U-SQL joins with equality and in-equality comparisons
The query first identifies a subset of data, DeptID and DeptMaxSalary, and stores it in a new rowset variable. Then both rowset variables are joined on DeptID to identify all employees that have a salary that is lower than the highest salary in his/her dept.
@employees2 =
SELECT DeptID,
MAX(Salary) AS DeptMaxSalary
FROM @employees
GROUP BY DeptID;
@result =
SELECT a.EmpName, a.DeptID, a.Salary, b.DeptMaxSalary
FROM @employees AS a
JOIN
@employees2 AS b
ON a.DeptID == b.DeptID
WHERE a.Salary < b.DeptMaxSalary;
OUTPUT @result
TO "/Output/ReferenceGuide/Joins/exampleA.csv"
USING Outputters.Csv();
B. How to write U-SQL joins with non-equijoins.
Returns each employee, his/her salary, and the number of employees with a higher salary
@result =
SELECT a.EmpName,
a.Salary,
COUNT( * ) - 1 AS NumberOfEmployeesWithHigherSalary
FROM @employees AS a
CROSS JOIN @employees AS b
WHERE a.Salary <= b.Salary
GROUP BY a.EmpName, a.Salary;
OUTPUT @result
TO "/Output/ReferenceGuide/Joins/exampleB.csv"
ORDER BY Salary DESC
USING Outputters.Csv();
C. How to write U-SQL joins where join comparison includes more complex expressions
All employees that have a salary that is equal to the adjusted highest salary in his/her dept.
@employees2 =
SELECT DeptID,
MAX(Salary)-1000 AS DeptAdjustedMaxSalary
FROM @employees
GROUP BY DeptID;
@result =
SELECT a.EmpName, a.DeptID, a.Salary, b.DeptAdjustedMaxSalary
FROM @employees AS a
JOIN
@employees2 AS b
ON a.DeptID == b.DeptID AND a.Salary == b.DeptAdjustedMaxSalary;
OUTPUT @result
TO "/Output/ReferenceGuide/Joins/exampleC.csv"
USING Outputters.Csv();
D. How to write U-SQL joins where join comparison includes more complex expressions, additional example.
This query identifies departments that have more than three employees. The expression is then joined back to the rowset variable to identify the employees from departments which have more than three employees.
@result =
SELECT a.DeptID,
a.EmpName
FROM @employees AS a
JOIN
(
SELECT DeptID
FROM @employees
GROUP BY DeptID
HAVING COUNT( * ) > 3
) AS b
ON a.DeptID == b.DeptID;
OUTPUT @result
TO "/Output/ReferenceGuide/Joins/exampleD.csv"
USING Outputters.Csv();