I will begin with a brief overview of SQL. Then the five major topics a data scientist should understand when working with relational databases: basic statistics in SQL, data preparation in SQL, advanced filtering and data aggregation, window functions, and preparing data for use with analytics tools.
3. Jean Joseph
Data Engineer/DBA
Blog : bigdatadriven.org
Email: jean.joseph@bigdatadriven.org
Twitter: @garella79/@cloudatadriven
LinkedIn: https://www.linkedin.com/in/jeandjoseph/
In IT: For over 18 plus years
From: New Jersey
Original From: Haiti
4. Overview
Brief intro to SQL. The five major things to
know in RDBMS
Data preparation in
SQL SQL advanced filtering
preparing data for use
with analytics tools
Key takeaways
5. What is
RDBMS?
• Relational Data Management System
• Tabular
• Row(s), Colum(s)
• Objects (Tables, Views, Synonyms,
Functions, Procedures,..)
• Normalization – (OLTP)
• De-Normalization (OLAP)
• ACID
6. What Is SQL?
• Stand for Query Structure Language
• SQL lets you Control, Create, Modify
object(s) and manipulate data
What
Can We
Do With
SQL?
DDL
CREATE, DROP, TRUNCATE, ALTER, COMMENT, RENAME
DQL
SELECT
DML INSERT, UPDATE, DELETE
DCL GRANT, REVOKE
TCL
COMMIT, ROLLBACK, SAVEPOINT, SET
Why SQL?
CRUD
Data Scientist Should Master
7. Type Of SQL
Joins
APPLY (Transact-SQL).
• CROSS
• OUTTER
PIVOT (Transact-SQL).
UNION [ALL]
EXCEPT
INTERCECT
Join Data Set
9. Position Character Set
Transformation Soundex
SQL Functions To Prep Data
• CHARINDEX
• PATINDEX
• LEN
• STUFF
• STRING_AGG
• SUBSTRING.
• STRING_SPLIT
• STRING_ESCAPE
• TRANSLATE
• CONCAT_WS
• CONCAT
• LEFT
• RIGHT
• LOWER
• UPPER
• LEN
• TRIM
• REPLACE
• REVERSE
• REPLICATE
• ASCII
• CHAR
• NCHAR
• UNICODE
• DIFFERENCE
• SOUNDEX
SQL Functions To Prep Data
10. AGGREGATE vs WINDOWING FUNCTIONS
AGGREGATE FUNCTIONS:
• which operate on an entire data set or table and are used with a GROUP BY clause.
WINDOWING FUNCTIONS:
• do not cause rows to become grouped into a single output row, the rows retain their
separate identities an aggregated value will be added to each row.
Types of Window functions:
• Aggregate Window Functions
• SUM(), MAX(), MIN(), AVG(). COUNT()
• Ranking Window Functions
• RANK(), DENSE_RANK(), ROW_NUMBER(),
NTILE()
• Value Window Functions
• LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE()
WINDOW_FUNCTION ( [ ALL ] expression )
OVER ( [ PARTITION BY partition_list ]
[ ORDER BY order_list] )
12. CHARINDEX ( expressionToFind ,
expressionToSearch
[ , start_location ]
)
Data Prep Position Function - CHARINDEX
String: This is a great event
Parameter Description
string Required. The string to extract from
start
Required. The start position. The
first position in string is 1
length
Required. The number of characters
to extract. Must be a positive
number
SUBSTRING(string, start, length)
13. Data Prep Position Function
PATINDEX
PATINDEX ( '%pattern%' , expression )
%pattern% Required. The pattern to find. It MUST be surrounded
by %.
• % - Match any string of any length (including 0 length)
• _ - Match one single character
• [] - Match any characters in the brackets, e.g. [xyz]
• [^] - Match any character not in the brackets, e.g. [^xyz]
• | | string | Required. The string to be searched |
Find any string that contain
big, and end with driven.org'
14. Data Prep Position Function - TRING_AGG
STRING_AGG ( input_string, separator ) [ order_clause ]
v input_string is any type that can be converted VARCHAR and NVARCHAR when
concatenation.
v separator is the separator for the result string. It can be a literal or variable.
v order_clause specifies the sort order of concatenated results using WITHIN
GROUP clause:
WITHIN GROUP ( ORDER BY expression [ ASC | DESC ] )
15. Data Prep Position Function - STRING_SPLIT
STRING_SPLIT(string, separator)
16. Analytic
• CUME_DIST (Transact-SQL)
• FIRST_VALUE (Transact-SQL)
• LAG (Transact-SQL)
• LAST_VALUE (Transact-SQL)
• LEAD (Transact-SQL)
• PERCENT_RANK (Transact-SQL)
• PERCENTILE_CONT (Transact-SQL)
• PERCENTILE_DISC (Transact-SQL)
Aggregate
• APPROX_COUNT_DISTINCT()
• AVG ()
• CHECKSUM_AGG ()
• COUNT ()
• COUNT_BIG ()
• GROUPING ()
• GROUPING_ID
• MAX ()
• MIN ()
• STDEV ()
• STDEVP ()
• SUM ()
• VAR ()
• VARP ()
Windowing Function – Framing
• Rows/Range
• Rows is in memory
• Range is in Tempdb
• Keywords
• Preceding
• Following
• Unbounded
• Current
• Ranking
• ROW_NUMBER()
• RANK()
• DENSE_RANK
• NTILE
SQLAggregate & Analytical Functions
18. Business Request:
Provide the Total sales Due, Total Average Sales Orders, Total
Number of Sales Orders and Total Sales Rank
Orders for each year including all fees (Tax, Shipping, ..).
Task:
• Get all sales orders for each year
• To calculate the
• Sum of Total Due by Year.
• Total AVG of Sales Orders by Year.
• Total Number of Sales Orders by Year.
• How well products are selling relative to other years.
Hint: Each year --> GROUP BY (AGGREGATE FUNCTIONS)
19. Business Request: RUNNING TOTAL
Ø Provide the Daily Running Total Due on
Sales Orders and include CustomerID,
SalesOrderID, OrderDate for the period
of 2014-06-01 onward.
Ø Order by SalesOrderID, OrderDate
20. Business Request:
The Marketing Team has asked you to return
the first three (3) orders, plus the close price,
Total Order Due, Total Orders for every
customer that purchased more than 15 times
from us.
Tasks:
q Find all orders details per customer
q return only the first 3 orders per customer
with:
ü Close Price
ü Total Orders
ü where Total Order Counts is greater
than 15
21. Business User has asked you to return all orders from the
SalesOrderHeader table for any customer who had over
$10.000 in purchases for their first three transactions
Task:
Ø Find the first three orders per customer
Ø Aggregate the first three orders
Ø Return all orders for those customers
Ø Return all customers with over $10.000
22. Business Request:
Find the summary of the first and last close price for each day of every months of the year including every
single details.
23. Business wants this display
To be like this format
Calculate the total due for each month, and get the subtotal for each
year by region and territory.
24.
25.
26. The Five Major Things To know In
RDBMS
• CRUD
• ACID
• TCL
• Query Optimizer
• When To Use Index
Bonus
• Exception
29. Jean Joseph
Data Engineer/DBA
Blog : bigdatadriven.org
Email: jean.joseph@bigdatadriven.org
Twitter: @garella79/@cloudatadriven
Thank You So much For Your
Participation!