Oracle provides several analytical functions that allow for powerful data analysis using SQL. These include group functions that aggregate data over groups or windows, as well as window functions like ROW_NUMBER, RANK, and LAG that analyze data relative to the current row. ROLLUP and CUBE extensions to the GROUP BY clause enable calculation of subtotals across multiple dimensions of data with a single query.
2. Sub classing of the aggregate functions
Category Usage
Group These functions deal with the grouping operation or the group aggregate
function.
Unary Group These functions take an arbitrary <value expresssion> as an argument.
Binary Group These functions take a pair of arguments, a dependent one and an
independent one, both of which are numeric expressions. They remove
NULL values from the group and if there are no remaining rows, they
evaluate to 0.
Window These functions compute their aggregate values the same as a group
function, except that they aggregate over the window frame of a row
and not over a group of grouped table.
3. Data Analysis with SQL
Today information management systems along with operational applications need to supporta wide variety
of business requirements that typically involvesome degree of analytical processing. Theserequirements
can rangefromdata enrichment and transformation during ETL workflows, creatingtimebasedcalculations
like moving averageand moving totals for sales reports, performing real-time pattern searches within logs
files to building what-if data models during budgeting and planning exercises. Developers, business users
and projectteams can choosefroma wide rangeof languages to create solutions to meet these
requirements.
Over time many companies havefound that the use so many differentprogramming languages to drivetheir
data systems creates fivekey problems:
1.Decreases the ability to rapidly innovate
2.Creates data silos
3.Results in application-level performancebottlenecks that are hard to trace and rectify
4.Drives up costs by complicating the deployment and management processes
5.Increases thelevel of investmentin training
Development teams need to quickly deliver new and innovative applications that providesignificant
competitive advantageand drive additional revenuestreams. Anything that stifles innovation needs to be
urgently reviewed and resolved. The challenge facing many organizations is to find the right platform and
language to securely and efficiently managethe data and analytical requirements while at the same time
supporting the broadestrangeof tools and applications to maximize the investment in existing skills.
4. Concepts behind analytical SQL
Oracle’sin-databaseanalytical SQL – firstintroduced in Oracle Database8i Release 1- has
introducedseveral new elements to SQL processing. These elements build on existingSQL
features to provide developers and business users with a framework that is both flexible and
powerful in terms of its ability to supportsophisticatedcalculations.There are four essential
conceptsused in the processingof Oracle’sanalytic SQL:
❖ Processing order
❖ Result set partitions
❖ Calculation windows
❖ Current Row
This four-stepprocessis internally optimized and completely transparentbutit does provide
a high degree of flexibility in terms of being able to layer analytical features to create the
desired result set without having to resort to long and complicatedSQL statements.The
following sections will explore these four conceptsin more detail.
5. Processing order
Query processing using analytic SQL takes place in three stages:
Stage 1: All joins, WHERE, GROUP BY and HAVING clauses are performed.
Where customers are using Exadata storage servers the initial join and
filtering operations for the query will be managed by the storage cells. This
step can result in a significant reductionin the volume of data passed to the
analytic function, which helps improve performance.
Stage 2: The result set is made available to the analytic function, and all the
calculations are applied.
Stage 3: If the query has an ORDER BY clause then this is processed to allow
for precise control of the final output.
6. Partitions
Partitions – organizing your data sets
Analytic SQL allows users to divide query result sets into ordered groups of rows
called “partitions”2. Any aggregated results such as SUM's, AVG's etc. are available
to the analytical functions. Partitions can be based upon any column(s) or
expression. A query result set may have just one partition holding all the rows, a
few large partitions, or many small partitions with each holding just a few rows.
7. Calculation windows
Within each partition, a sliding window of data can be defined. The window determines the
range of rows used to perform the calculationsfor the "currentrow" (defined in the next
section). Window sizes can be based on either a physical number of rows or a logical interval
such as time.
The window has a startingrow and an ending row. Depending on its definition, the window
may move at one or both ends.
For instance,a window defined for a cumulative sum function would have its startingrow
fixed at the first row of its window, and its ending row would slide from the startingpointall
the way to the last row of the window.
SELECT
Qtrs
, Months
, Channels
, Revenue
, SUM(Revenue)OVER (PARTITIONBY Qtrs)AS Qtr_Sales
, SUM(Revenue)OVER () AS Total_Sales FROMsales_table
8. Current Row
Each calculationperformed with an analytic function is based on a current row within
a window. The current row serves as the reference point determining the startand end
of the window. In the example below the calculationof a running totalwould be the
result of the currentrow plus the values from the preceding two rows.At the end of
the window the running totalwill be reset. The example shown below createsrunning
totalswithin a result set showing the totalsales for each channel within a product
categorywithin year:
SELECT calendar_year , prod_category_desc
, channel_desc
, country_name
, sales
, units
, SUM(sales)OVER (PARTITIONBY calendar_year, prod_category_desc,
channel_desc order by country_name)sales_tot_cat_by_channelFROM. . .
9. FIRST_VALUE
It is an analyticfunctionas the name suggests is used to providethe valueof the
first row in an orderedset of rows.
Example:
In this examplewe are going to look intothe lowest age based on city in the table
employee.
select empid , age,
doj,
FIRST_VALUE(age)
OVER(PARTITION BY city
ORDER BY employee_id
)FIRST_
from emp;
10. LAST_VALUE
Itis also an analytical function which is used to get the valueof the last row in an ordered set of rows.
Example:
In this example we will try to get the highest age based on the city in the employees table.
select empid ,
age, city,
LAST_VALUE(age)
OVER(PARTITIONBYcity
ORDER BY employee_id
RANGE BETWEENUNBOUNDEDPRECEDINGAND
UNBOUNDED FOLLOWING)HIGHEST_AGE
fromemployee;
The clause ‘RANGEBETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING’ means thatthe
window framestarts at the firstrow and ends in the last row of the result set.
11. LEAD
It allows us to access a following row from the current row based on an offset
value without using self join.
Example:
In this example we are going to get the following age of the employees from the
city of Delhi.
SELECT
city,
age,
LEAD(age) OVER ( ORDER BY city ) following_employee_age
FROM
employee
WHERE
city = 'Delhi';
12. LAG
that allows you to access the row at a given offsetprior to the currentrow without
using a self-Join.
Example:
SELECT
city,
age,
LAG(age) OVER (
ORDER BY city
) following_employee_age
FROM
employee
WHERE
city = 'Delhi';
13. Nth Value
The name suggests that it returns the Nth value among set of values.
Example:
In this example we will find the salary of the second highest salary for the
department wise.
SELECT
deptid,
salary,
NTH_VALUE(salary,2) OVER (PARTITION BY deptid ORDER BY salary DESC
RANGE BETWEEN
UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING
) AS second_salary
FROM
emp;
14. ROW_NUMBER
This function assignsa unique sequential number to each row of the result set.
Example
SELECT
ROW_NUMBER() OVER(
ORDER BY salary DESC
) row_number,
empid,
empname,
doj
FROM
emp
15. RANK
It is used to calculatethe rank of a value in an ordered set of values. One important
point that makes it different from DENSE_RANKis that the ranksfrom this function
may not be consecutive numbers.
Example:
In this example we are going to find the rank of each employee based on its salary in
descending order.
SELECT empid, empname,salary, RANK() OVER(ORDER BY salary desc)
RANK_NUMBERfrom emp;
16. PERCENT_RANK
To calculate a percentage rank for a value among an ordered set of values.
Example:
In this example we will calculate the percent rank for salary of each empid in
table emp.
SELECT
empid,
empname,
salary,
ROUND(PERCENT_RANK() OVER (ORDER BY salary DESC) * 100,2) || '%' percent_rank
FROM
emp;
17. DENSE_RANK
It is a type of analytic function that calculatesthe rank of a row. Unlike the RANK
function this function returnsrank as consecutive integers.
Example:
In this example we are going to find the rank of the column city in EMPLOYEEtable.
SELECT deptid,
DENSE_RANK() OVER ( ORDER BY deptid)
dept_rank
FROM
EMP;
18. CUME_DIST
To calculatethe cumulativedistribution ofa certain valueamong a set of values.
Example:
In thisexamplewe are going to get the salary percentileforeach employee.
SELECT
empid,
salary,
ROUND(cume_dist() OVER (ORDER BY salary DESC) * 100,2) || '%' cumulative_dist
FROM
emp;
19. LISTAGG
An aggregate function that returns a single row. This is used to transform data
from multiple rows into a single list of values separated by a given delimiter.
It operates on all rows and returns single.
It returns a comma or other delimiterseparatedresult set just like an excel
CSV file.
It returns a string value.
As it is an aggregationfunction, if any non-aggregate column is being used in
SELECT statement then that non-aggregate column must be declared in the
GROUP BY Clause as well.
It is an analytic function.
This functionpartition the result set into groups with the use of OVER ( )
clause.
The return result set size of this function is 4000 bytes.
20. LISTAGG - Continue
Syntax:
LISTAGG(Column [, Delimiter])WITHIN GROUP (Order by Col_name)
[Over(Partition by Col_name)]
Description:
COLUMN: It’s a column name in which LISTAGG function operates. It can be a column,
constant, or expression.
Delimiter: It is used to separate the values in the result row. It is a string type. It is optional
and by default it is NULL.
WITHIN GROUP: It is a mandatory clause.
Order by Col_name: Order by clause is used for sorting the data according to the given
Col_name. Col_name can be a column name or expression for sorting the data. By default,
the order is ascending.
Over(Partition by Col_name): This is an optional clause. It is used with the LISTAGG function
for grouping the result set based onCol_name. Col_name can be a column name or
expression(s) in the Partition by clause. This clause is used as an analytic function.
21. LISTAGG – Continue - Examples
LISTAGG() Function without GROUP BY Clause
SELECT LISTAGG(Name,'') WITHIN GROUP (ORDER BY Name)Agg_Name FROM Employee;
LISTAGG() Function with GROUP BY Clause
SELECT Deptnumber, LISTAGG(Name,'') WITHIN GROUP (ORDER BY Name)Agg_Name
FROM Employee GROUP BY Deptnumber;
GROUP BY clause is used for grouping the data. If any aggregate function is being
used in the SELECT statement with any non-aggregate column then GROUP BY
clause must be used for the non-aggregate column to group the data accordingly.
WITH Partition
SELECT Deptnumber, Designation, LISTAGG(Name,'|') WITHIN GROUP (ORDER BY
Designation, Name) OVER(Partition by Deptnumber) Agg_Name FROM Employee ORDER
BY Deptnumber, Designation, Name;
22. ORACLE ROLLUP,CUBE AND GROUPING
FUNCTIONS
CUBE, ROLLUP, and Top-N Queries
The last decade has seen a tremendous increase in the use of query, reporting, and on-line analytical
processing(OLAP)tools,ofteninconjunctionwithdata warehouses anddata marts. Enterprises exploringnew
markets and facinggreater competitionexpect these tools to provide the maximumpossible decision-making
value fromtheir data resources.
Oracle expands its long-standingsupport foranalytical applications inOracle8i release 8.1.5 withthe CUBE and
ROLLUP extensions to SQL. Oracle alsoprovides optimizedperformance andsimplifiedsyntax for Top-Nqueries.
These enhancements make important calculations significantly easier andmore efficient,enhancingdatabase
performance,scalability andsimplicity.
ROLLUP andCUBE are simple extensions to the SELECT statement's GROUPBY clause.ROLLUPcreates subtotals
at any level of aggregationneeded,fromthe most detailedup to a grand total. CUBE is anextensionsimilarto
ROLLUP,enablinga single statement to calculate all possible combinations ofsubtotals.CUBE cangenerate the
informationneededincross-tabreports witha single query. To enhance performance,bothCUBE andROLLUP
are parallelized: multiple processes cansimultaneously executebothtypes of statements.
EnhancedTop-N queries enable more efficient retrieval of the largest andsmallest values of a data set. This
chapter presents concepts,syntax,andexamples of CUBE,ROLLUP andTop-N analysis.
23. ROLLUP
ROLLUP enables a SELECTstatement to calculate multiple levels of subtotals across a specified
group of dimensions. It also calculates a grand total. ROLLUP is a simple extension to the GROUP BY
clause, so its syntax is extremely easy to use. The ROLLUP extension is highly efficient,adding
minimal overhead to a query.
Syntax
ROLLUP appears in the GROUP BY clause in a SELECTstatement. Its form is:
SELECT... GROUP BY
ROLLUP(grouping_column_reference_list)
Details
ROLLUP's action is straightforward: it creates subtotals which "roll up" from the most detailed level
to a grand total, following a grouping list specified in the ROLLUP clause. ROLLUP takes as its
argument an ordered list of grouping columns. First, it calculates the standard aggregate values
specified in the GROUP BY clause. Then, it creates progressively higher-levelsubtotals, moving from
right to left through the list of grouping columns. Finally, it creates a grand total.
ROLLUP will create subtotals at n+1 levels, where n is the number of grouping columns. For
instance, if a query specifiesROLLUP on grouping columns of Time, Region, and Department ( n=3),
the result set will include rows at four aggregation levels.
24. ROLLUP – Cont.,
Example
This example of ROLLUP uses the data in the video store database.
select deptid,salary,sum(salary) salary_sum
from emp
group by rollup(deptid,salary)
order by deptid;
Regular aggregation rows that would be produced by GROUP BY without using ROLLUP
First-level subtotals aggregating across Department for each combination of Time and Region
Second-level subtotals aggregating across Region and Department for each Time value
A grand total row
When to Use ROLLUP
Use the ROLLUP extension in tasks involving subtotals.
It is very helpful for subtotaling along a hierarchical dimension such as time or geography. For instance, a query
could specify a ROLLUP of year/month/day or country/state/city.
It simplifies and speeds the population and maintenance of summary tables. Data warehouse administrators
may want to make extensive use of it. Note that population of summary tables is even faster if the ROLLUP
query executes in parallel.
25. CUBE
CUBE appears in the GROUP BY clause in a SELECT statement. Its form is:
SELECT ... GROUP BY
CUBE (grouping_column_reference_list)
Details
CUBE takes a specified set of grouping columns and creates subtotals for all
possible combinations of them. In terms of multi-dimensional analysis, CUBE
generates all the subtotals that could be calculated for a data cube with the
specified dimensions. If you have specified CUBE(Time, Region, Department),
the result set will include all the values that would be included in an
equivalent ROLLUPstatement plus additional combinations. For instance, in
Table 20-1, the departmentaltotals across regions (279,000 and 319,000)
would not be calculated by a ROLLUP(Time, Region, Department) clause, but
they would be calculated by a CUBE(Time, Region, Department) clause. If
there are n columns specified for a CUBE, there will be 2n combinations of
subtotals returned. Table 20-3 gives an example of a three-dimension CUBE.
26. CUBE Example
select deptid,gender,age,sum(salary) from emp
group by cube(deptid,gender,age) order by deptid;
When to Use CUBE
Use CUBE in any situation requiring cross-tabular reports. The data needed for cross-
tabular reports can be generated with a single SELECT using CUBE. Like ROLLUP, CUBE
can be helpful in generating summary tables. Note that population of summary tables is
even faster if the CUBE query executes in parallel.
CUBE is especially valuable in queries that use columns from multiple dimensions rather
than columns representing different levels of a single dimension. For instance, a
commonly requested cross-tabulation might need subtotals for all the combinations of
month/state/product. These are three independent dimensions, and analysis of all
possible subtotal combinations will be commonplace. In contrast, a cross-tabulation
showing all possible combinations of year/month/day would have several values of
limited interest, since there is a natural hierarchy in the time dimension. Subtotals such
as profit by day of month summed across year would be unnecessary in most analyses.
27. GROUPING
Two challenges arise with the use of ROLLUP and CUBE. First, how can we programmatically
determine which result set rows are subtotals, and how do we find the exact level of
aggregation of a given subtotal? We will often need to use subtotals in calculations such as
percent-of-totals, so we need an easy way to determine which rows are the subtotals we
seek. Second, what happens if query results contain both stored NULL values and "NULL"
values created by a ROLLUP or CUBE? How does an application or developer differentiate
between the two?
To handle these issues, Oracle 8i introduces a new function called GROUPING. Using a single
column as its argument, Grouping returns 1 when it encounters a NULL value created by a
ROLLUP or CUBE operation. That is, if the NULL indicates the row is a subtotal, GROUPING
returns a 1. Any other type of value, including a stored NULL, will return a 0.
Syntax
GROUPING appears in the selection list portion of a SELECT statement. Its form is:
SELECT ... [GROUPING(dimension_column)...] ...
GROUP BY ... {CUBE | ROLLUP}
28. GROUPING - Example
select gender,age,sum(salary)sumSal,
grouping(gender),
grouping(age)
from emp
group by rollup(gender,age)