U-SQL is a data processing language that unifies the benefits of SQL with the expressive power of your own code. U-SQL’s scalable distributed query capability enables you to efficiently analyze data in file, object and relational stores such as Azure SQL Database.
Let’s assume that I have downloaded my Twitter history of all my tweets, retweets, and mentions as a CSV file and placed it into my Azure Data Lake Store. In this case I know the schema of the data I want to process, and for starters I want to just count the number of tweets for each of the authors in the tweet “network”:
@t = EXTRACT date string , time string , author string , tweet string FROM "/input/MyTwitterHistory.csv" USING Extractors.Csv(); @res = SELECT author , COUNT(*) AS tweetcount FROM @t GROUP BY author; OUTPUT @res TO "/output/MyTwitterAnalysis.csv" ORDER BY tweetcount DESC USING Outputters.Csv();
The above U-SQL script shows the three major steps of processing data with U-SQL:
Note that U-SQL’s SQL keywords have to be upper-case to provide syntactic differentiation from syntactic C# expressions with the same keywords but different meaning. Also notice that each of the expressions are assigned to a variable (@t and @res). This allows U-SQL to incrementally transform and combine data step by step expressed as an incremental expression flow using functional lambda composition (similar to what you find in the Pig language). The execution framework, then, composes the expressions together into a single expression. That single expression can then be globally optimized and scaled out in a way that isn’t possible if expressions are being executed line by line.
Going back to our example, I now want to add additional information about the people mentioned in the tweets and extend my aggregation to return how often people in my tweet network are authoring tweets and how often they are being mentioned. Because I can use C# to operate on the data, I can use an inline C# LINQ expression to extract the mentions into an ARRAY. Then I turn the array into a rowset with EXPLODE and apply the EXPLODE to each row’s array with a CROSS APPLY. I union the authors with the mentions, but need to drop the leading @-sign to align it with the author values. This is done with another C# expression where I take the Substring starting at position 1.
@t = EXTRACT date string , time string , author string , tweet string FROM "/input/MyTwitterHistory.csv" USING Extractors.Csv(); @m = SELECT new SQL.ARRAY<string>( tweet.Split(' ').Where(x => x.StartsWith("@"))) AS refs FROM @t; @t = SELECT author, "authored" AS category FROM @t UNION ALL SELECT r.Substring(1) AS r, "mentioned" AS category FROM @m CROSS APPLY EXPLODE(refs) AS Refs(r); @res = SELECT author , category , COUNT(*) AS tweetcount FROM @t GROUP BY author, category; OUTPUT @res TO "/output/MyTwitterAnalysis.csv" ORDER BY tweetcount DESC USING Outputters.Csv();
If you analyze the characteristics of Big Data analytics, several requirements arise naturally for an easy to use, yet powerful language:
How do existing Big Data languages stack up to these requirements?
SQL-based languages (such as Hive and others) provide you with a declarative approach that natively does the scaling, parallel execution, and optimizations for you. This makes them easy to use, familiar to a wide range of developers, and powerful for many standard types of analytics and warehousing. However, their extensibility model and support for non-structured data and files are often bolted on and harder to use. For example, even if you just want to quickly explore your data in a file or remote data source, you need to create catalog objects to schematize file data or remote sources before you can query them, which reduces your agility. And although SQL-based languages often have several extensibility points for custom formatters, user-defined functions, and aggregators, they are rather complex to build, integrate, and maintain, with varying degrees of consistency in the programming models.
Programming language-based approaches to process Big Data, for their part, provide an easy way to add your custom code. However, a programmer often has to explicitly code for scale and performance, often down to managing the execution topology and workflow such as the synchronization between the different execution stages or the scale-out architecture. This code can be difficult to write correctly, and optimized for performance. Some frameworks support declarative components such as language integrated queries or embedded SQL support. However, SQL may be integrated as strings and thus lacking tool support, the extensibility integration may be limited or – due to the procedural code that does not guard against side-effects – hard to optimize, and does not provide for reuse.
Taking the issues of both SQL-based and procedural languages into account, we designed U-SQL from the ground-up as an evolution of the declarative SQL language with native extensibility through user code written in C#. This unifies both paradigms, unifies structured, unstructured, and remote data processing, unifies the declarative and custom imperative coding experience, and unifies the experience around extending your language capabilities.
U-SQL is built on the learnings from Microsoft’s internal experience with SCOPE and existing languages such as T-SQL, ANSI SQL, and Hive. For example, we base our SQL and programming language integration and the execution and optimization framework for U-SQL on SCOPE, which currently runs hundred thousands of jobs each day internally. We also align the metadata system (databases, tables, etc.), the SQL syntax, and language semantics with T-SQL and ANSI SQL, the query languages most of our SQL Server customers are familiar with. And we use C# data types and the C# expression language so you can seamlessly write C# predicates and expressions inside SELECT statements and use C# to add your custom logic. Finally, we looked to Hive and other Big Data languages to identify patterns and data processing requirements and integrate them into our framework.