SQL CHEAT SHEET CREATED BY Tomi Mester from Data36.com Tomi Mester is a data analyst and researcher. He worked for Prezi, iZettle and several smaller companies as an analyst/consultant. He’s the author of the Data36 blog where he writes posts and tutorials on a weekly basis about data science, AB- testing, online research and data coding. SQL For Dummies Cheat Sheet By Allen G. Taylor This Cheat Sheet consists of several helpful tables and lists, containing information that comes up repeatedly when working with SQL. In one place, you can get a quick answer to a number of different questions that frequently arise during an SQL development effort.-->
This cheat sheet provides helpful tips and best practices for building dedicated SQL pool (formerly SQL DW) solutions.
The following graphic shows the process of designing a data warehouse with dedicated SQL pool (formerly SQL DW):
Queries and operations across tables
When you know in advance the primary operations and queries to be run in your data warehouse, you can prioritize your data warehouse architecture for those operations. These queries and operations might include:
- Joining one or two fact tables with dimension tables, filtering the combined table, and then appending the results into a data mart.
- Making large or small updates into your fact sales.
- Appending only data to your tables.
Knowing the types of operations in advance helps you optimize the design of your tables.
First, load your data into Azure Data Lake Storage or Azure Blob Storage. Next, use the COPY statement to load your data into staging tables. Use the following configuration:
|Resource Class||largerc or xlargerc|
Learn more about data migration, data loading, and the Extract, Load, and Transform (ELT) process.
Distributed or replicated tables
Use the following strategies, depending on the table properties:
|Type||Great fit for...||Watch out if...|
|Replicated||* Small dimension tables in a star schema with less than 2 GB of storage after compression (~5x compression)||* Many write transactions are on table (such as insert, upsert, delete, update)|
* You change Data Warehouse Units (DWU) provisioning frequently
* You only use 2-3 columns but your table has many columns
* You index a replicated table
|Round Robin (default)||* Temporary/staging table|
* No obvious joining key or good candidate column
|* Performance is slow due to data movement|
|Hash||* Fact tables|
* Large dimension tables
|* The distribution key cannot be updated|
- Start with Round Robin, but aspire to a hash distribution strategy to take advantage of a massively parallel architecture.
- Make sure that common hash keys have the same data format.
- Don't distribute on varchar format.
- Dimension tables with a common hash key to a fact table with frequent join operations can be hash distributed.
- Use sys.dm_pdw_nodes_db_partition_stats to analyze any skewness in the data.
- Use sys.dm_pdw_request_steps to analyze data movements behind queries, monitor the time broadcast, and shuffle operations take. This is helpful to review your distribution strategy.
Learn more about replicated tables and distributed tables.
Index your table
Indexing is helpful for reading tables quickly. There is a unique set of technologies that you can use based on your needs:
|Type||Great fit for...||Watch out if...|
|Heap||* Staging/temporary table|
* Small tables with small lookups
|* Any lookup scans the full table|
|Clustered index||* Tables with up to 100 million rows|
* Large tables (more than 100 million rows) with only 1-2 columns heavily used
|* Used on a replicated table|
* You have complex queries involving multiple join and Group By operations
* You make updates on the indexed columns: it takes memory
|Clustered columnstore index (CCI) (default)||* Large tables (more than 100 million rows)||* Used on a replicated table|
* You make massive update operations on your table
* You overpartition your table: row groups do not span across different distribution nodes and partitions
T Sql Cheat Sheet Pdf
- On top of a clustered index, you might want to add a nonclustered index to a column heavily used for filtering.
- Be careful how you manage the memory on a table with CCI. When you load data, you want the user (or the query) to benefit from a large resource class. Make sure to avoid trimming and creating many small compressed row groups.
- On Gen2, CCI tables are cached locally on the compute nodes to maximize performance.
- For CCI, slow performance can happen due to poor compression of your row groups. If this occurs, rebuild or reorganize your CCI. You want at least 100,000 rows per compressed row groups. The ideal is 1 million rows in a row group.
- Based on the incremental load frequency and size, you want to automate when you reorganize or rebuild your indexes. Spring cleaning is always helpful.
- Be strategic when you want to trim a row group. How large are the open row groups? How much data do you expect to load in the coming days?
Learn more about indexes.
You might partition your table when you have a large fact table (greater than 1 billion rows). In 99 percent of cases, the partition key should be based on date. Be careful to not overpartition, especially when you have a clustered columnstore index.
With staging tables that require ELT, you can benefit from partitioning. It facilitates data lifecycle management.Be careful not to overpartition your data, especially on a clustered columnstore index.
Learn more about partitions.
If you're going to incrementally load your data, first make sure that you allocate larger resource classes to loading your data. This is particularly important when loading into tables with clustered columnstore indexes. See resource classes for further details.
We recommend using PolyBase and ADF V2 for automating your ELT pipelines into your data warehouse.
For a large batch of updates in your historical data, consider using a CTAS to write the data you want to keep in a table rather than using INSERT, UPDATE, and DELETE.
It's important to update statistics as significant changes happen to your data. See update statistics to determine if significant changes have occurred. Updated statistics optimize your query plans. If you find that it takes too long to maintain all of your statistics, be more selective about which columns have statistics.
You can also define the frequency of the updates. For example, you might want to update date columns, where new values might be added, on a daily basis. You gain the most benefit by having statistics on columns involved in joins, columns used in the WHERE clause, and columns found in GROUP BY.
Learn more about statistics.
Resource groups are used as a way to allocate memory to queries. If you need more memory to improve query or loading speed, you should allocate higher resource classes. On the flip side, using larger resource classes impacts concurrency. You want to take that into consideration before moving all of your users to a large resource class.
If you notice that queries take too long, check that your users do not run in large resource classes. Large resource classes consume many concurrency slots. They can cause other queries to queue up.
Finally, by using Gen2 of dedicated SQL pool (formerly SQL DW), each resource class gets 2.5 times more memory than Gen1.
Learn more how to work with resource classes and concurrency.
Lower your cost
A key feature of Azure Synapse is the ability to manage compute resources. You can pause your dedicated SQL pool (formerly SQL DW) when you're not using it, which stops the billing of compute resources. You can scale resources to meet your performance demands. To pause, use the Azure portal or PowerShell. To scale, use the Azure portal, PowerShell, T-SQL, or a REST API.
Autoscale now at the time you want with Azure Functions:
Optimize your architecture for performance
We recommend considering SQL Database and Azure Analysis Services in a hub-and-spoke architecture. This solution can provide workload isolation between different user groups while also using advanced security features from SQL Database and Azure Analysis Services. This is also a way to provide limitless concurrency to your users.
Learn more about typical architectures that take advantage of dedicated SQL pool (formerly SQL DW) in Azure Synapse Analytics.
Deploy in one click your spokes in SQL databases from dedicated SQL pool (formerly SQL DW):
Download this 2-page SQL Basics Cheat Sheet in PDF or PNG format, print it out, and stick to your desk.
The SQL Basics Cheat Sheet provides you with the syntax of all basics clauses, shows you how to write different conditions, and has examples. You can download this cheat sheet as follows:
You may also read the contents here:
SQL Basics Cheat Sheet
SQL, or Structured Query Language, is a language to talk to databases. It allows you to select specific data and to build complex reports. Today, SQL is a universal language of data. It is used in practically all technologies that process data.
QUERYING SINGLE TABLE
Fetch all columns from the
Fetch id and name columns from the city table:
Fetch city names sorted by the
rating column in the default ASCending order:
Fetch city names sorted by the
rating column in the DESCending order:
FILTERING THE OUTPUT
COMPARISON OPERATORSFetch names of cities that have a rating above 3:Fetch names of cities that are neither Berlin nor Madrid:
TEXT OPERATORSFetch names of cities that start with a 'P' or end with an 's':Fetch names of cities that start with any letter followed by'ublin' (like Dublin in Ireland or Lublin in Poland):
OTHER OPERATORSFetch names of cities that have a population between 500K and 5M:Fetch names of cities that don't miss a rating value:Fetch names of cities that are in countries with IDs 1, 4, 7, or 8:
QUERYING MULTIPLE TABLES
JOIN (or explicitly
INNER JOIN) returns rows that have matching values in both tables.
LEFT JOIN returns all rows from the left table with corresponding rows from the right table. If there's no matching row,
NULLs are returned as values from the second table.
RIGHT JOIN returns all rows from the right table with corresponding rows from the left table. If there's no matching row,
NULLs are returned as values from the left table.
FULL JOIN (or explicitly
FULL OUTER JOIN) returns all rows from both tables – if there's no matching row in the second table,
NULLs are returned.
CROSS JOIN returns all possible combinations of rows from both tables. There are two syntaxes available.
NATURAL JOIN will join tables by all columns with the same name.
NATURAL JOIN used these columns to match rows:
NATURAL JOIN is very rarely used in practice.
AGGREGATION AND GROUPING
GROUP BYgroups together rows that have the same values in specified columns. It computes summaries (aggregates) for each unique combination of values.
avg(expr)− average value for rows within the group
count(expr)− count of values for rows within the group
max(expr)− maximum value within the group
min(expr)− minimum value within the group
sum(expr)− sum of values within the group
Find out the number of cities:
Find out the number of cities with non-null ratings:
Find out the number of distinctive country values:
Find out the smallest and the greatest country populations:
Find out the total population of cities in respective countries:
Find out the average rating for cities in respective countries if the average is above 3.0:
A subquery is a query that is nested inside another query, or inside another subquery. There are different types of subqueries.
Best Sql Cheat Sheet
The simplest subquery returns exactly one column and exactly one row. It can be used with comparison operators
This query finds cities with the same rating as Paris:
A subquery can also return multiple columns or multiple rows. Such subqueries can be used with operators
This query finds cities in countries that have a population above 20M:
A correlated subquery refers to the tables introduced in the outer query. A correlated subquery depends on the outer query. It cannot be run independently from the outer query.
This query finds cities with a population greater than the average population in the country:This query finds countries that have at least one city:
Set operations are used to combine the results of two or more queries into a single result. The combined queries must return the same number of columns and compatible data types. The names of the corresponding columns can be different
UNION combines the results of two result sets and removes duplicates.
UNION ALL doesn't remove duplicate rows.
This query displays German cyclists together with German skaters:
INTERSECT returns only rows that appear in both result sets.
This query displays German cyclists who are also German skaters at the same time:
EXCEPT returns only the rows that appear in the first result set but do not appear in the second result set.
This query displays German cyclists unless they are also German skaters at the same time: