1. Project Report on
Developing statistical models of demand
forecasting for domestic trade market
operations of Torrent Pharmaceuticals Ltd.
In partial fulfillment of requirements of
Master of Business Administration (2006-08)
Submitted BY
SUCHIT SHAH
Roll no.51
MBA-I
SUBMITTED TO
AES PGIBM
Undertaken at
Torrent Pharmaceuticals Limited
2. ACKNOWLEDGEMENT
I here by wish to take the opportunity to express my gratitude to Mr.
K G Ramchandran- General Manager Human Resources ; for allowing
to undertake my summer training at a well reputed organization like
Torrent Pharmaceuticals Limited and Ms. MIti Randeri and Ms.
Mallika Priyadarshini- Assistant Manager HR for taking care of all our
official requirements.
I express my sincere thanks and gratitude to Mr. Vipul Patel- General
Manager Supply Chain Management and Mr. Chandan Chatterjee-
AGM Supply Chain Management for guiding and encouraging me to
carry out my project work successfully.
I wish to convey my deepest regards and thanks to my project guide,
Mr. Bhavesh Nainani-Manager DPC and Mr. Deep Vyas -Manager
DPC for their all help, timely guidance and feedback in spite of a
very busy schedule. They
always managed to find time to sit with me and provide the necessary
guideline and ideas.
I also wish to express my sincere thanks to Mr. Bhavin Shah-
Assistant Manager DPC and Mr. Jayant Nikhare- Assistant Manager
DPC for their help and support in every possible way.
Finally I wish to thank all staff of Supply Chain Management
Department for their kind co operation during the tenure of my
project.
Suchit Shah
Summer Trainee
May-July (2007)
AES Post Graduate Institute of Business Management
Ahmedabad
2
3. TABLE OF CONTENTS
Heading Page No.
Executive Summary 04
Torrent Group-overview 06
Mission Vision & Values 06
Objective 07
Salient Features 07
Project Constraint 07
Assumptions made during the project 07
Project Overview 09
Benefits Expected 10
Demand Forecasting-Introduction 10
The basic steps in a forecasting task 11
Company network 14
Demand Planning @SCM Dept 17
Exponential Smoothing 25
Triple Exponential Smoothing 25
Multiple Regressions with ‘n’ factors 33
Multiple Regression with MS Excel 42
LINEST function 49
Fitting Multiple Regression Model 46
Methodology 56
Regional level forecasting 75
Findings 79
Recommendations 83
3
4. Future scopes of the model 83
Limitations 84
References 85
Executive Summary
This report first attempts to study, how Planning system works at Domestic Demand
Planning Cell (DPC), SCM dept, Torrent Pharmaceuticals Ltd with a view to get
acquainted with the system & processes.
It tries to understand the various reports prepared by DPC.What type of data are
maintained, in what form, in what type? To understand the existing Demand
forecasting procedure. An exploratory analysis is done.
Initially it attempts to get the idea of product basket. Products are primarily classified as
per their sales behavior. In the initial phase, it studies the product basket. It attempts to
identify and define all the factors which may directly or indirectly affect the sales of the
SKU.
After getting acquainted with the product basket and its behavior, it defines the definition
of problem. Demand forecasting is the process of determining what products are
needed where, when, and in what quantities. It is needed to forecast for the sales in
a way so that it shows less fluctuations. It explores all concerned topic with demand
forecasting.
The present system of the forecasting is well structured and well defined. Somehow it
has not been able to show accurate results of forecasting. The system can’t quantify the
fluctuation in the actual sales. That is why a need, for developing a statistical model,
arises. Then report tries to explore for the alternative models available.
With having the sales data of 24 months, a triple exponential smoothing model is
applied with its assumptions. That hasn’t shown the desired output and so, it has been
rejected.
The actual sales data are affected by many parameters. They all should be taken care
of and should be given effect to the actual sales. After seeing the complexities, it
decides to apply Multiple Regression model with the parameters short listed. It takes
all the assumption of multiple regression for granted. Under the model it considers the
sales as an independent and affecting parameter as dependent one.
4
5. A database is made for getting the data of included parameters. Data are collected from
the SAP and ORG-Marg. Primary sales and institutional sales are got through the
internal source of data. Tertiary sales are got through ORG-data.
By using MS Excel (2003), multiple regression is fit to the data and the forecasts are
generated for the future months. Forecast accuracy is calculated as per DPC
methodology. These results are compared with the results of the existing systems.
It has shown a significant increase in forecast accuracy.
A model of demand forecasting at regional level is also made. But it can neither be
analyzed nor be validated due to time constraint.
It is recommended to implement model for demand forecasting with the personnel
intervention and with the addition of due insights.
5
6. Torrent Group: Overview
It all began with the inspired efforts of one enterprising individual, Shri U N Mehta, when
he ventured on his own to create history in the Indian Pharmaceutical industry, by
successfully implementing the concept of niche marketing. With the launch of
Trinicalm Plus, an effective tranquilizer, the foundation of the company was laid
as ‘Trinity Laboratories, which was later, renamed ‘Torrent ‘. Today Torrent is one
of the leading pharmaceutical companies of India. Torrent is multifaceted and
dynamic group dedicated to transforming life by serving two of its most critical
needs- healthcare and energy.
In the power sector, the Torrent Group remains the most experienced private
sector player in the state of Gujarat. Torrent just lunched a mega project, the 1100
MW SUGEN CCPP, being set up at an investment of Rs. 3096 crores, is a backward
integration move of Torrent Power to secure a reliable source of supply for its
Ahmedabad and Surat distribution areas.
The project is strategically located. It is close to River Tapi, National Highway No.8, gas
supply infrastructure comprising LNG terminals and main gas trunk lines The plant
would comprise of 3 advanced class gas turbines with a high operating efficiency.
Environmental and social impact of this project is minimal due to use of eco-friendly
Natural Gas
The flagship company of Torrent group, Torrent Pharmaceuticals Limited, is a dominant
player in the therapeutic areas of cardiovascular (CV) and central nervous system (CNS)
and has achieved significant presence in gastro-intestinal, diabetology, anti-infective and
pain management segments.
To cater to new niche segments and sharpen its focus among customers, Torrent Pharma
has ‘11’ marketing divisions, each catering to defined therapeutic segment. Torrent
Pharma’s competitive advantage as a manufacturer stems from its world-class
manufacturing facilities. Its manufacturing facilities at Indrad, Gujarat, comply with
USFDA,WHO, cGMP, MHRA and TGA norms and have received ISO 9001, ISO 14001 and
OHSAS 18001 (Occupational Health and Safety Management System) and ISO/IEC- 17025
certifications.
With a view to cater to its growth requirements, Torrent Pharma commissioned a new state
of art formulations manufacturing facility at Baddi, Himachal Pradesh, in November 2005.
The facility has a capacity to manufacture 3600 million tablets, 400 million capsules and 18
million Oral Liquid bottles, per annum and would cater to the domestic formulations
requirement.
Torrent has a modern and well-equipped state-of-the-art R&D Centre, built with an
investment of US $ 40 million. It is manned by more than 525 highly qualified scientists, with
a combined experience of over 2500 scientific man-years in Drug Discovery and
Development. Torrent Pharma has earmarked 9% of sales year-after-year for R&D
advancement.
6
7. In the International operations arena, Torrent Pharma exports to more than 50 countries
around the world with over 1000 product registrations. The international business has been
broadly divided into five zones- USA, Latin America, Russia and CIS, Western Europe and
CEE and Rest of the World (ROW). For its export excellence in International Business,
Torrent Pharma has won several prestigious export awards.
Torrent Pharma is now gearing up to enter the advanced highly regulated international
markets. Torrent Pharma has incorporated Zao Torrent Pharma in Russia, Torrent Do Brasil
Ltda in Brazil, Torrent Pharma GmbH in Germany, Torrent Pharma Inc. in USA and Torrent
Pharma Philippines Inc. in Philippines. These wholly owned subsidiaries will become a
springboard for entry into several regulated and less regulated international markets.
TORRENT PHARMACEUTICALS LIMITED
Mission:
We commit ourselves to total customer care by delivering world –class products and
services.
Vision
To be the leader in the pharmaceutical industry
Values
A set of core value continue to guide us through the process of transforming the
conglomerate into a high-performing and caring organization for our customers,
employees, shareholders and society.
· Improving quality of life of our customers, as we believe quality is a way of life.
· Creating value for our shareholders, for the trust bestowed on us.
· Building an empowered and ethical Torrent family, as the foundation for a
bright future.
· Responsibility towards the society and environment, as we owe our existence
to them.
· Being innovative in solutions, for being different, counts.
· Striving for excellence in whatever we do, to follow the exclusive path to
leadership.
· Flexibility and speed shall be our oars for navigating the turbulent seas.
7
8. Objective of Project
To develop a statistical model of demand forecasting for domestic trade
market operations of Torrent Pharmaceuticals Ltd.
@Gross level
@Regional level
Salient Features
Following are the salient features of the project
It aims to improve the existing demand forecasting process by using a
statistical tool
It tries to cover all the quantitative and qualitative factors which affect
the actual sales
It takes into consider the uncertain fluctuations and captures them
It discusses the product specific sales behavior
It makes the whole forecasting procedure a dynamic one
It reveals the clear picture of Pharma sector from drug specific to
macro level
Project Constraints
The project is based on the tertiary sales made available from
ORG-Marg data, it may contain inaccuracy up to some extent
The project doesn’t include the secondary sales data
The project may considers the parameters only for which data are
available
The project tries to estimate the future values of the parameters
Assumptions Made during the project
Data, which are collected, is accurate.
Future estimates of the parameters are true.
Parameters taken into considerations are least correlated
Data collection horizon ranges from May’05 to June’05
8
9. Project overview
To study how the demand planning works at SCM dept.,
To develop a statistical tool for demand forecasting
9
Identifying and defining the parameters
Applying various statistical tools
Matching output of the model with the past sales
Comparison with existing system and suggested system
Checking the robustness of the tool
Comparison with existing system and suggested system
Implementation, if all the criteria are fulfilled for all or
partial numbers of SKUs
10. Benefits expected
Minimization of overstocking
Reducing the gap between orders and actual sales
No opportunity loss, which may result into growth
Better inventory control and hence better cash flow
Better utilization of resources
Dispatch efficiency
Smooth operation flow from demand planning to order execution
Prior planning of recruitment and changes in workforce
Proper allocation of promotional budget
DEMAND FORECASTING- a brief overview
Introduction
What does the word forecast mean?
The word “fore” means ‘watch out’ in golf and is shouted as a warning to anyone who
could potentially be in the path of a misplaced golf ball. The word “cast” to an angler
means “throw out.” Putting the two words together, a word is made i.e. “forecast”. That
means “watch out and Throw out.’
Forecast management is the process of making, checking, correcting and using
forecasts. It also includes determination of the forecast horizon.
Forecast- An estimate of future demand. A forecast can be determined by
mathematical means using historical data. It can be created subjectively by using
estimates from informal sources, or it can represent a combination of both techniques.
Forecasting involves making projections about future performance on the basis of
historical and current data.
Forecast methods can be divided into history-based and future-based ones.
History-based demand forecasts are analytic methods based on consume
statistics. They can be further divided into mathematical and graphic methods.
Future-based demand forecasts use already existing information about future
demand e.g. offers, confirmed orders in a contracting phase and interviews on
customer behavior. (Schönsleben, 1998) In this study, conditional variance
models are used for quantifying the demand process uncertainty. The uncertainty
can for example be dependent on the level of demand, the previous shocks and
the historic level of the variance process.
Understanding customer demand is key to any manufacturer to make and keep
sufficient inventory so customer orders can be correctly met. The discipline that helps a
supply chain forecast and plan well is called as demand planning.
10
11. Accurate and timely demand plans are a vital component of and effective supply chain.
Inaccurate demand forecasts typically would result in supply imbalances. Although
revenue forecast accuracy is important for corporate planning, forecast accuracy at the
SKU level is critical for proper allocation of resources.
Types of forecasting
Quantitative forecasting is used when sufficient quantitative information is
available.
Qualitative forecasting is used when little quantitative information is available, but
sufficient qualitative knowledge exists.
Quantitative forecasting can be applied when three conditions exist:
1. Information about the past is available
2. This information can be quantified in the form of numerical data.
3. It can be assumed that some aspects of the past pattern will continue into
the future.
Under quantitative forecasting methods, there are tow major types of forecasting
models: Explanatory models
Time series forecasting
Explanatory models assume that the variable to be forecasted exhibits an explanatory
relationship with one or mote independent variables.
Time series forecasting deals with the past data only. It makes no attempt to discover
the factors affecting its behavior. The objective of time series forecasting methods is to
discover the pattern in the historical data series and extrapolate that pattern into the
future.
Forecast Management
Forecast management is the process of making, checking, correcting and using
forecasts. It also includes determination of the forecast horizon.
While designing a forecasting system, the policy issues of what to forecast, why
forecast is needed, and who does the forecasting must be addressed. A forecast is
meaningful only in relation to planning and decision making in some area of business
application. Thus, an important aspect of any forecasting system is knowing and
planning how it will be used in business planning, budgeting, and the operations
aspects of master scheduling and inventory planning. Different attributes of the
forecasting system of varying levels of concern and interest to people in each of these
areas.
The basic steps in a forecasting task
Forecasting is a five steps sequential process for which quantitative data is available.
Step 1: Problem Definition
Step 2: Gathering information
Step 3: Preliminary (exploratory analysis)
Step 4: Choosing and fitting models
Step 5: Using and evaluating a forecasting model
11
12. Problem definition
The definition of problem involves developing a deep understanding of how the
forecasts will be used, who requires the forecasts, and how the forecasting function fits
within the organization. It is worth spending time talking to everyone who sill be involved
in collecting data, maintaining databases, and using the forecasts for future planning.
A forecaster has a great deal of work to do to properly define the forecasting problem,
before any answers can be provided. One need to know exactly wha products are
stored who uses them, how long it takes to produce each item, what level of unsatisfied
demand the company is prepared to bear, and so on.
Gathering information
The information available can be mainly of two types:
1. Statistical data
2. The accumulated judgment and expertise of key personnel
Exploratory analysis
By calculating simple statistics like mean, standard deviation, correlation, minimum,
maximum, percentiles associated with each set of data. On having more than one
series of historical data, one can use descriptive statistics for exploration.
The purpose of doing this at this stage is to get a feel for the data. Do they follow
consistent patterns? Is there evidence of the presence of business cycles? Are there
any outliers in the data that need to be explained by those with expert knowledge? How
strong are the relationships among the variables available for analysis?
Choosing and fitting models
After doing the exploratory analysis, it can be understood that how to handle the data.
What pattern and what behavior is being observed? One can understand that what are
the things that affect the actual sales?
So, it is the stage when one can choose the model which is to be fitted. One can
interpret the characteristics of the actual past data. And can also determine which
model can be chosen? One has to match the assumption of the specific models with the
data. After choosing the model, one should fit it to the data. If necessary than it should
be modified accordingly.
Using and evaluating a forecasting model
After fitting the model with the actual data, inference can be derived. Accordingly one
can have the forecasts as per the model for the future data. It should be checked by
holding one month actual data, and giving the forecast. After getting that forecast, it
should be compared with the data. Forecast effectiveness (forecast accuracy) should be
calculated. If that is better than the present system, it should be used.
12
13. SCM @ TORRENT
Supply Chain Management coordinates entire channel from supplier to customer.
Supply Chain Management is the management of the entire value-added chain, from
the supplier to manufacturer right through to the retailer and the final customer. Supply
chain management coordinates almost all the departments of the company. It links the
departments and smoothens the whole system.
SCM has three primary goals:
· Reduce inventory,
· Increase the transaction speed by exchanging data in real-time,
· Increase sales by implementing customer requirements more efficiently.
Planning done at SCM is the indicator for all the other departments i.e. Production,
finance, marketing and HR also.
Torrent’s supply chain management is mainly bifurcated in to two divisions,
i.e. Domestic operations division
International operations divisions
Domestic operations division is bifurcated in to following,
i.e. C&FA Cell
Demand Planning Cell
Indrad Warehouse
Zirakpur Warehouse
PPC- Indrad & Baddi
Supply chain management department is well equipped with necessary infrastructure. It
has all the means of modern software, and hardware.
To cater and handle the large company multipoint network across India and whole
world, SAP is implemented at TORRENT SCM dept., MM module (Material
Management module) and PP module (Production planning module) is used by the
department personnel.
Microsoft excel is used extensively at the department. Various MIS are prepared by
using MS Excel from SAP data.
Company network
Torrent’s corporate office is based at Ahmedabad. All the supporting activities are
conducted from the HO (based at Ahmedabad).
Two plants are situated at Indrad (Gujarat) and Baddi (Himachal Pradesh).
Most of the domestic requirement is served by the Baddi plant. Company has set up its
warehouse at Zirakpur (Punjab).Products produced at Baddi ppc are stored at Zirakpur
warehouse. All dispatches are done from the warehouse.
Company has 25 carrying and Forwarding agents across all over India.
C&FAs are responsible for the primary sales in the particular allocated region.
C&FAs are the agents which sell the products to the stockiest. They get the orders from
the stockiest and that is further put to the supply chain department at HO.
13
14. Again all these activities are coordinated by Supply Chain Department.
Company engages in mainly two type of selling.
· Trade sales
· Institutional sales
Sales with trade aspect are the sales done through the channels of C&FAs. While
institutional sales are the sales to the institutions like hospitals, railways, army etc…
14
15. SCM@HO
15
Demand planning
Baddi ppc
Indrad PPC
Indrad Warehouse
Zirakpur Warehouse
Warehouse to C&FA
C&FAs C&FAs C&FAs C&FAs
Primary sales
Stockist Institutions
Stockist
Retailer
Secondary sales
Retailer Retailer
Customer Tertiary sales
Stock Transfer
Inter C&FA
Company network
16. Products
Company has a product basket consisting 500+ products.
Company produces products in the form of Tablets, capsules, liquid and injections.
Each product is allocated a unique 7 digit product code.
From marketing point of view, there are 11 divisions made; accordingly the drugs are
allocated to the divisions.
Sensa, Mind, Axon, Neuron, Azuca, Psycan, Omega, Delta, Prima, Vista, Alfa
These division again are classified into three groups; PVA, APOD, SMAN
Where; PVA= Prima, Vista, Alpha (Anti Infective segment)
APOD= Azuca, Psycan, Omega, Delta (Cardiology and Diabetology)
SMAN=Sensa, Mind, Axon, Neuron (Central Nervous System)
Product Classification
These products show different behaviors in selling quantity. Accordingly one should also
classify as,
· Matured (stable) products
· Seasonal product
· New products
Matured products are the products which are there in a market since longtime.
They show the particular pattern and do not show significant deviation. One can
understand the fluctuation. They reflect clearly the stable pattern.
E.g. Nikoran 5, Deplatt tab, Antidep
Seasonal products are the products which show particular seasonal behaviors.
Sales goes high in particular season i.e. in particular month.
By having more than one cycle i.e. a year, it can also be estimated that amount hike
due to the particular season. There are certain products which depend on the season.
E.g. Quintor Infusion is the product which has shown high sales in the month of April,
May.
New Products are the products which are launched within 6 months. It is not easy to
estimate its behavior. By having less data, one can not capture the trend and the
amount of deviation. So, it is not that easy to capture the fluctuation in the selling
quantity.
e.g Rimofit, Rimoslim are the product just launched in the month of May’07.
16
17. Demand Planning @SCM Dept.,
Demand planning is the process through which an organization generates a forecast of
market demand for its products on a regular basis. This allows the organization to
calculate a historically based statistical forecast for each point (that is, part
number/warehouse combination). Some key output variables include demand in pieces,
demand in customer orders, pieces per customer order; standard (forecast) deviation,
and pieces per deviation.
At Torrent, there is a separate demand planning cell under the SCM dept., which
conducts the demand planning on the basis of 4 months rolling plan. Under the rolling
plan, planning is done 4 months prior to the corresponding month. Planning includes
the demand forecasting, production planning, Supply planning, and dispatch
planning.
Demand plan is first given by the marketing department. And then it is to be reviewed by
the demand planning cell. For every product in each division, demand plan is reviewed,
and corrected if needed.
On the 20th day of every month M Demand planning is done by the demand planning
cell for the month M+3 .
After deciding demand plan, it is being executed by the related departments of the
company in a very sequential manner and in a very structured way. All the planning like
production planning, financial planning. Dispatch planning, procurement planning is
made accordingly.
Company produces most of the products at in-house facilities i.e. at the Indrad and
Baddi plants. While for certain products, company has P2P and LLM arrangements.
P2P is principle to principle arrangement, in which the products made by other
companies, are marketed by TORRENT. Drug license and manufacturing licenses must
be had by that company. Torrent need not to have drug license and manufacturing
license. There are approx 230 products which are received from P2P.
.
LLM is Loan License manufacturing, in which the company uses the plant of other
companies. But TORRENT uses the facilities of others’. Torrent must have a drug
license and manufacturing license of that particular drug. There are approx 38 products
which are received from LLM.
It is very complex task to forecast for the products which are not produced in house, as
it has a longer lead time than the products produced in-house.
17
18. Demand Planning deals with these arrangements. They are responsible for getting the
products in time and for planning its demand, dispatches at the right point of the time at
least cost.
Due to certain circumstances, it is not possible to execute all the orders got from the
stockiest. There are situations when they are not able to connect the stock as per the
order.
It generally happens due to certain situation like non availability of raw material,
machine breakdown, transportation problems, or due to sudden excess demand.
In some cases, it can be known in advance that a particular product may not be
available for the coming month. So that product is declared as Non available product,
which is abbreviated as NAP. This can be the genuine sales, if proper demand
planning.
On the beginning of every month, every aspect of the past month is analyzed and
proper justification is done to the particular aspect. Certain reports like Gap report, Nap
report, inventory analysis report, connectivity report etc… are prepared.
Planning Horizon-4 months rolling plan
(Tentative plan)
Solid rock
July
Solid
August
Slushy
September
Liquid
October
M M+1 M+2 M+3
Let’s consider the month of June’07 as a reference point. According to the 4 months
rolling plan, in the month of June, demand planning is made for the coming 4 month. As
it is a continuous process, a new month is added every month. Status is changed for the
consequent months.
Next two consecutive months are considered as a frozen. That means in the month ‘M-
1’, demand plan is made fixed for the next two months i.e. ‘M’ and ‘M+1’. It can not be
changed in the status of solid rock, solid status.
While the planning for the 3rd and 4th month is made tentative.
Status of these months is Slushy and Liquid. In tentative plan, demand can be changed
as per the constraints.
In the same manner, the status of month ‘M’ and ‘M+1’ were tentative in ‘M-3’.
Status of every month is changed on arrival at the new month.
18
19. Existing system of forecasting
Torrent’s domestic operation system work on make to stock basis. Products are
manufactured prior to the orders are received at C&FA.Hence there is a need to
forecast the sales in advance.
Optimum quantity should be produced to serve the market.
At Torrent, Existing system of forecasting doesn’t use the specific statistical tool.
Forecasting process is performed based on the past data & statistics like average sales,
minimum, maximum sales and orders are taken into consideration.
Division vise demand plan is prepared by marketing department on the basis of field
target. This plan is reviewed by the demand planning cell. So, according to the schedule
of rolling plan, the demand plan is made.
Demands (forecasts) are generally predicted on the basis of past data. Past behavior
of the resent months along with the general trend is considered to forecast. Field
targets given to the sales force also are taken in to considerations. That means
quantitative data is considered.
Certain factors like epidemics, seasonal effect and the some visible factors are taken
care of. Visible factors include the competitor’s move, market behavior, and
authoritarian factors. These factors are the qualitative data. Qualitative data should be
quantified in a particular manner.
Considering all these factors, forecasts are put forward.
Present system works more on the judgment, no particular statistical tool is applied.
So, it has not been able to capture all these factors precisely. Fluctuations can not be
quantified in the proper proportion. There may be a bias in estimation and quantification
of these parameters.
These all results in to forecast which doesn’t match exactly with the actual sales.
Forecasts made do not fit to the actual data.
Poor forecast accuracy will result into
· Dispatch inefficiencies.
· Loss of genuine sales
· High inventory, so does the blockage of working capital
· High lead time
It’s must to have good forecast accuracy.
Forecast accuracy here is less, which needs to be improved.
Hence, there is a need to develop a system (model), which takes care of all the
concerned factors. All the factors are needed to be understood and are to be quantified
properly. How a single factor affects different SKUs in different manner.
19
20. By demand planning cell, a file named CODIS is prepared, which is Correlation among
orders, demand, Inventory and sales. From the SAP, for every product a data is
available which gives the demand, orders got, sales, and the total availability.
By this file, it is tried to analyze the actual scenario, to what extent orders are executed.
%Variation of demand to sales and % variation to orders is calculated.
That shows how the demand is close or away from the actual sales and orders.
Graph shown on the next two pages are the graphs, showing the status of orders,
demand, sales and stock. And the other is showing the % variation demand to sales
and % variation demand to orders with the corresponding trend lines.
The graph given on the next page is for the product Alprax, 0.5 tabs, which composites
the molecule Alprazolam, which belongs to the class Tranquilizers.
20
21. Forecast Accuracy Jun'05 - May'07
2000000
1800000
1600000
1400000
1200000
1000000
800000
600000
400000
200000
0
Quantity (Units)
June '05 July '05 Aug '05 Sept'05 Oct '05 Nov '05 Dec'05 Jan '06 Feb '06 March
'06
April '06 May '06 June '06 July '06 Aug '06 Sept'06 Oct'06 Nov '06 Dec'06 Jan'07 Feb'07 Mar'07 Apr'07 May'07 June'07
Demand 950010 900120 850000 860000 772000 855000 820000 730000 750000 700000 850000 800000 900000 950000 900000 900130 670000 600130 500130 500000 450000 375000 525000 525000 550000
Orders 878172 755927 795987 885868 665098 749917 773628 691974 616661 686613 940231 913461 962695 839190 855761 741450 490813 538989 492931 534135 558444 428911 693125 607044 596129
Sales 562903 499151 525658 590579 443185 499145 515552 459476 411027 432582 613461 604213 621979 557786 568361 476396 486713 535429 491120 527857 533434 420533 668147 580078 574958
TA @ CFA 1431263 1449613 1413207 1364700 1322452 1577114 1412617 1138279 933865 1319114 1727856 1282504 1235664 1275416 1552737 767560 687486 387987 792249 261937 667944 610092 893657 904710 734085
Demand Orders Sales TA @ CFA
21
22. Tracking Forecast Acccuracy
40.00%
20.00%
0.00%
-20.00%
-40.00%
-60.00%
June
'05
July '05 Aug '05 Sept'05 Oct '05
Nov
'05
Series1 -40.75%-44.55% -38.16% -31.33% -42.59% -41.62% -37.13% -37.06%-45.20%-38.20%-27.83%-24.47%-30.89%-41.29% -36.85%-47.07%-27.36%-10.78% -1.80% 5.57% 18.54% 12.14% 27.27% 10.49%
Series2 -7.56% -16.02% -6.35% 3.01% -13.85% -12.29% -5.66% -5.21% -17.78% -1.91% 10.62% 14.18% 6.97% -11.66% -4.92% -17.63% -26.74% -10.19% -1.44% 6.83% 24.10% 14.38% 32.02% 15.63%
Month
% Variation
Series1 Series2
Linear (Series2) Linear (Series1)
Dec'05 Jan '06 Feb '06
March
'06
April
'06
May
'06
June
'06
July '06 Aug '06 Sept'06 Oct'06
Nov
'06
Dec'06 Jan'07 Feb'07 Mar'07 Apr'07 May'07
The above given is the graph showing %variation of sales and orders to demand
22
23. Forecast accuracy at TORRENT with the present system
It is said to an accurate forecast, if;
Sales= (90% to 110% of the forecast)
Demand planning cell, at TORRENT, calculates the forecast accuracy on the beginning
of the month for the past month. Forecast accuracy, at the gross level and C&FA level,
are calculated.
An actual sale during the last month is compared to the projected demand of the
corresponding product and corresponding month.
The deviation of actual sales from the demand is calculated.
Let’s consider for the ‘X’ product, the actual sales are ‘Yt’ and accordingly the forecast
for the same is ‘Ft’.
Then the deviation is calculated by the formula, (Yt-Ft)/Ft. This will give us the %
deviation of demand to sales.
At TORRENT, a range is defined for the specification of the forecast accuracy.
A forecast is considered to be a HIT, if it fluctuates within the range of the +/-10% range,
otherwise miss.
MS Excel is used for the purpose. By the present system, it has shown less accurate
results.
There is a need to work on the demand forecasting.
Present system is efficient when it comes to the stable, fast moving and matured
products.
Present system takes care of the products, which have shown high skewness due to
promotions and schemes.
Present system can estimate well the sales of the product which are to be launched.
Demand planning cell also interacts with the marketing people about the product
behavior on line extension. Demand planning cell does the well job in estimating
fluctuation of the forthcoming incidents, which can be known in advance.
Present system has its own unique features.
System is well defined and well designed. It is a foolproof system.
Defining parameters
Parameters are the factors, which directly or indirectly affect the actual sales.
These factors are needed to be identified. What types of factors affect the actual sales?
Factors which have a direct effect and indirect effect should be explored out. By the
process of exploration one can have a list of parameters. Then it needs to be sort out in
way to get the parameters which have a significant impact on it. There are the statistical
methods to check the significance of various parameters on the actual sales.
These can be the factors which can affect the actual sales.
23
24. Trend
Total availability of the SKU
Seasonal factors i.e. for months
Promotions and schemes
Price sensitivity
Market share
Market growth of the SKU
Market growth of the molecule
Market growth of the brand
Market growth of the molecule class
Market growth expected by the organizations
Additional duties, taxes levied by government
Introduction of new drugs by competitor in the same segment
Line extension by company
Introduction of new drugs by company in the same segment
Regional factor
Drugs with Same molecule
Same Products with the different power
Institutional sales
Sales force
Field Targets
Secondary sales
No. of stockiest
Tertiary sales
No. of retailers
Miscellaneous factors (Epidemic, Billing channel, Government factors,
availability, orders etc…)
Above stated can be the factors which can have significant impact on the actual sales.
There must be a proper selection of the parameters for having accurate and close
forecasts.
Matching the results
After developing an appropriate model, models should be applied on the past data.
Forecast for the past data should be done. It should be compared with the actual past
data to verify the reliability and validity of the model. Various other statistical tools can
be used to check for the same purpose.
Comparing it with the present system
After developing the models, it is necessary to compare it with the present system.
If it gives better results than the present one or not. Comparison should be on the basis
of various aspects, it should give reliable and consistent results. Does it have an impact
on inventory level? Does it have an impact on profitability? Can it make the whole
system smoother?
24
25. Is it Robust?
Models should give the accurate results in any situation. If it gives the proper forecast in
any situation, then it should be implemented. Model should capture the fluctuation.
It should react to the adjustment done on foreseeing certain factors. Model has to be
robust. It should be flexible towards the changes done. And it should react accordingly.
Implementation
After inspecting all the criteria, one should validate the model. If it gives reliable,
consistent and precise results and have a significant impact on the topics of concern.
Then it should be implemented. It should be used for the future.
Statistical tool
Tools which can be considered are
Time series
Exponential smoothing
Multiple regressions
Many forecasting methods are based on the concept that when an underlying pattern
exists in a data series, that pattern can be distinguished from randomness by smoothing
(averaging) past values. The effect of smoothing is to eliminate randomness so the
pattern can be broken down into sub patterns that identify each component of the time
series separately. Such a breakdown can frequently aid in better understanding the
behavior of the series, which facilitates improved accuracy in forecasting.
Time series decomposes the data in to the sub patterns. It analyzes the data and
separates the effects of the components.
Data= pattern error
=f (trend-cycle, seasonality, error)
But here at Torrent, there is a product basket having 500+ products. Each has a
different behavior to behave. There are several factors which affects the overall
dimensions. It is not enough to use time series. As it captures the trend, seasonality and
error.
To analyze and determine the trend, seasonality and level which is followed by the data,
Triple Exponential Smoothing is applied. On the basis of the assumption and the
methodology of the model, one can fit the model to the past data. And accordingly the
forecasts for the coming period are got on the basis of past data.
Data Availability
There is a 24 months data available, which gives the monthly primary sales of past 24
months i.e. From May’05 to May’06. Data available is of two complete cycles, which is
the least requirement of applying triple exponential smoothing. Primary sales are the
sales done through the channels of CFAs. But it also includes the institutional sales,
which is to be nullifying later. Data for institutional sales are got from the SAP as a
dump for the same period as stated above.
25
26. Exponential Smoothing
A model is an extension of moving average method and uses weighted moving
average. In this particular method, weights are allocated to the past data and the recent
data. A class of methods that imply exponentially decreasing weights as the
observations get older. This method has the property that recent values are given
relatively more weight in forecasting than the older observations.
Triple Exponential Smoothing (Holt Winters multiplicative model)
Holt’s method of exponential smoothing is developed by Winters (1960) to capture
seasonality.
It considers (1) Deseasonalized level
(2) Trend (Growth) level
(3) Seasonality
Let’s consider the, Original data i.e. monthly sales as Yt .
Deseasonalized factor Rt
Trend factor (Growth factor) Gt
Seasonal factor St
Forecast Ft
As monthly data is available for 24 months, we have two complete cycles. Data is
available from June’05 to May’07. In the table given on the next page shows the 3rd
column having these data.
To get the level and trend, one should apply the linear regression. In linear regression
Equation,
Y=a+bX; Y= actual sales
a= intercept (Rt)
b= Growth (Gt)
After getting the deseasonalized level and growth factor, seasonal factor is calculated.
Seasonal factor= Actual sales of the corresponding month
Forecasted sales for the same month by linear regression
By this one can have the seasonal factor. If it is greater than 1 than it is showing that
amount of higher sales due to season. If it is less than 1 than it is showing that amount
of less sales due to the season.
Equations for the Holt-Winters’ method are as follows;
Level: Rt = α*Yt + (1-α)*(Gt-1+Rt-1)
St-s
Trend: Gt=β*(Rt-Rt-1) +(1-β)*Gt-1
26
27. Seasonal: St=γ*Yt+ (1-γ)*St-s
Rt
Forecast: Ft= (Rt +Gt*X)*St-s-x
Here α, β, γ are the smoothing constant,0< α, β, γ<1.
These values are chosen by the forecaster as per the feasibility of the data. There can
be a bias in initializing the values of the smoothing constants. And it has been observed
that α, β, γ=0.5 gives the favorable results.
But to remove the bias of initializing the method is modified. So that it gives the same
results as per the above calculation.
The modified method is as follows:
Rather than using the smoothing equation for the trend, level and seasonal factors by
the above equation. One should fixed the trend and the level factor as it is got by the
linear regression. It should be held constant for every month i.e. for the past months as
well as the coming months.
For seasonal indices of the future months, one should consider the average of the same
corresponding months of the past cycles.
This makes the calculations easy for the value of all the smoothing constants as 0.5.
So below given is the forecast for the two drugs Nikoran 5 Mg tab and Torleva 500.
Last column indicates the %variation between the forecast and the actual sales.
For the past months, it has shown very less variation i.e.+/-10%
27
34. Forecast accuracy June'05-May'07(sales&orders)
3000000
2500000
2000000
1500000
1000000
500000
0
June
'05 July '05 Aug '05 Sept'05 Oct '05 Nov
'05 Dec'05 Jan '06 Feb '06 March
'06
April
'06
May
'06
June
'06 July '06 Aug '06 Sept'06 Oct'06 Nov
'06 Dec'06 Jan'07 Feb'07 Mar'07 Apr'07 May'07
sales 1601814 1127094 754860 780661 366338 519499 1153666 93867 161767 207495 551717 521824 1987064 851742 1250256 136720 185576 1005490 593723 253332 215842 225209 529518 756852
forecasted demand (sales) 21596851150926 1247370 486365 314832 980747 1007405 232877 241359 273507 680936 849166 1579151 834463 896345 346209 221874 683866 694563 158637 162316 181422 445060 546272
orders 20252871328359782084 781741 371538 526339 1163220 93867 161863 207591 556421 525592 2037925864498 1256448 140196 191010 1073078 603124 260148 232799 228318 560602 795742
forecasted demand (order) 247724612926221315619 495389 330171 10866511051971 253379 269189 292252 751399 951116 1723485889079 893926 332245 218369 708007 674449 159645 166439 177034 445103 549776
sales forecasted demand (sales) orders forecasted demand (order)
A graph showing the variation of forecast to sale and orders
34
35. Tracking forecasting accuracy
100%
50%
0%
-50%
-100%
-150%
-200%
-250%
June
'05
July
'05
Aug
'05
Sept'0
5
Oct
'05
Nov
'05
Dec'0
5
Jan
'06
Feb
'06
Marc
h '06
April
'06
May
'06
June
'06
July
'06
Aug
'06
Sept'0
6
Oct'0
6
Nov
'06
Dec'0
6
Jan'0
7
Feb'0
7
Mar'0
7
Apr'0
7
%variation sales Vs Demand -35% -2% -65% 38% 14% -89% 13% -148% -49% -32% -23% -63% 21% 2% 28% -153% -20% 32% -17% 37% 25% 19% 16% 28%
%variaton orders Vs Demand -22% 3% -68% 37% 11% -106% 10% -170% -66% -41% -35% -81% 15% -3% 29% -137% -14% 34% -12% 39% 29% 22% 21% 31%
% variation sales Vs past forecat 52% -46% -59% -28% -118% -54% -4% -220% -24% 13% 0% -5% 30% -41% -12% -47% -8% 10% -136% 41% 31% 33% 6% 34%
%variation orders Vs past forecast 62% -24% -53% -28% -115% -52% -3% -220% -24% 13% 1% -5% 31% -39% -11% -43% -5% 16% -132% 42% 36% 34% 11% 37%
month
% variation
%variation sales Vs Demand %variaton orders Vs Demand
% variation sales Vs past forecat %variation orders Vs past forecast
Linear (%variation sales Vs Demand) Linear (%variaton orders Vs Demand)
May'0
7
In the above SKU, it has shown much fluctuation in the past forecasts.
But this model works on some basic assumption and hence limitations;
35
36. It needs data of two cycles, but TORRENT has many products that are launched after
that. This means that this method fails with products having less data.
This method concentrates only on 3 parameters which are very less. As there are many
other probable factors which affect the actual sales. So, the method will not be able to
give the accurate results.
Method may also contain certain biases as the constants are initialized by the
forecaster.
So it is not advisable to carry on with the triple exponential method for forecasting.
A more robust, flexible, and inclusive model is needed to be chosen and fitted to the
data.
Need of another method
Another method must be applied, which can include every parameter affecting the
actual sales.
· A method which is adjustable to any change regarding the parameters.
· One which gives very significant results.
· One which gives elaborate explanations about the steps taken.
· The method which gives less error.
· One, which increases the forecast accuracy and effectiveness to the significant
level.
· A new method should be an inclusive one. Later when a new parameter is
identified, it should be able to consider it.
Multiple Regressions with ‘n’ factors
General Purpose
The general purpose of multiple regressions (the term was first used by Pearson, 1908)
is to learn more about the relationship between several independent or predictor
variables and a dependent or criterion variable.
.
Overview
Multiple regression, a time-honored technique going back to Pearson's 1908 use of it, is
employed to account for (predict) the variance in an interval dependent, based on linear
combinations of interval, dichotomous, or dummy independent variables. Multiple
regression can establish that a set of independent variables explains a proportion of the
variance in a dependent variable at a significant level (through a significance test of R2),
and can establish the relative predictive importance of the independent variables (by
comparing beta weights). Power terms can be added as independent variables to
explore curvilinear effects. Cross-product terms can be added as independent variables
to explore interaction effects. One can test the significance of difference of two R2's to
determine if adding an independent variable to the model helps significantly. Using
hierarchical regression, one can see how most variance in the dependent can be
explained by one or a set of new independent variables, over and above that explained
by an earlier set. Of course, the estimates (b coefficients and constant) can be used to
construct a prediction equation and generate predicted scores on a variable for further
analysis.
The multiple regression equation takes the form y = b1x1 + b2x2 + ... + bnxn + c. The b's
are the regression coefficients, representing the amount the dependent variable y
changes when the corresponding independent changes 1 unit. The c is the constant,
36
37. where the regression line intercepts the y axis, representing the amount the dependent
y will be when all the independent variables are 0. The standardized version of the b
coefficients is the beta weights, and the ratio of the beta coefficients is the ratio of the
relative predictive power of the independent variables. Associated with multiple
regression is R2, multiple correlation, which is the percent of variance in the dependent
variable, explained collectively by all of the independent variables.
Multiple regression shares all the assumptions of correlation: linearity of relationships,
the same level of relationship throughout the range of the independent variable
("homoscedasticity"), interval or near-interval data, absence of outliers, and data whose
range is not truncated. In addition, it is important that the model being tested is correctly
specified. The exclusion of important causal variables or the inclusion of extraneous
variables can change markedly the beta weights and hence the interpretation of the
importance of the independent variables.
Key Terms and Concepts
The regression equation takes the form
Y =bo+ b1*x1 + b2*x2 + e
; Where Y is the true dependent,
b's are the regression coefficients for the corresponding x (independent) terms,
c is the constant or intercept,
e is the error term reflected in the residuals.
Sometimes this is expressed more simply as
y = bo+ b1*x1 + b2*x2 + e
; Where y is the estimated dependent
‘e’ is the constant (which includes the error term).
Equations such as that above, with no interaction effects (see below), are called main
effects models. In MS Excel
Select Tools, Data Analysis, Regression
Analyze, Regression, Linear; select your dependent and independent variables; click
Statistics; select Estimates, Confidence Intervals, Model Fit; continue; OK.
Predicted values, also called fitted values, are the values of each case based on using
the regression equation for all cases in the analysis. In SPSS, dialog boxes use the
term PRED to refer to predicted values and ZPRED to refer to standardized predicted
values. Click the Save button in SPSS to add and save these as new variables in your
dataset.
Adjusted predicted values are the values of each case based on using the regression
equation for all cases in the analysis except the given case.
Residuals are the difference between the observed values and those predicted by the
regression equation.
Interaction effects are sometimes called moderator effects because the interacting
third variable which changes the relation between two original variables is a moderator
variable which moderates the original relationship. For instance, the relation between
income and conservatism may be moderated depending on the level of education.
The regression coefficient, b, is the average amount the dependent increases when
the independent increases one unit and other independents are held constant. Put
another way, the b coefficient is the slope of the regression line: the larger the b, the
steeper the slope, the more the dependent changes for each unit change in the
independent. The b coefficient is the unstandardized simple regression coefficient for
the case of one independent. When there are two or more independents, the b
37
38. coefficient is a partial regression coefficient, though it is common simply to call it a
"regression coefficient" also. In SPSS, Analyze, Regression, Linear; click the Statistics
button; make sure Estimates is checked to get the b coefficients (the default).
b coefficients compared to partial correlation coefficients. The b coefficient is a
semi-partial coefficient, in contrast to partial coefficients as found in partial correlation.
The partial coefficient for a given independent variable removes the variance explained
by control variables from both the independent and the dependent, then assesses the
remaining correlation. In contrast, a semi-partial coefficient removes the variance only
from the independent. That is, where partial coefficients look at total variance of the
dependent variable, semi-partial coefficients look at the variance in the dependent after
variance accounted for by control variables is removed. Thus the b coefficients, as
semi-partial coefficients, reflect the unique (independent) contributions of each
independent variable to explaining the total variance in the dependent variable.
Dynamic inference is drawing the interpretation that the dependent changes b units
because the independent changes one unit. That is, one assumes that there is a
change process (a dynamic) which directly relates unit changes in x to b changes in y.
This assumption implies two further assumptions which may or may not be true: (1) b is
stable for all sub samples or the population (cross-unit invariance) and thus is not an
artificial average which is often unrepresentative of particular groups; and (2) b is stable
across time when later re-samples of the population are taken (cross-time invariance).
t-tests are used to assess the significance of individual b coefficients. Specifically
testing the null hypothesis that the regression coefficient is zero. A common rule of
thumb is to drop from the equation all variables not significant at the .05 level or better.
Note that restricted variance of the independent variable in the particular sample at
hand can be a cause of a finding of no significance. Like all significance tests, the t-test
assumes randomly sampled data. In SPSS, Analyze, Regression, Linear; click the
Statistics button; make sure Estimates is checked to get t and the significance of b.
Level-importance is the b coefficient times the mean for the corresponding
independent variable. The sum of the level importance contributions for all the
independents, plus the constant, equals the mean of the dependent variable. Achen
(1982: 72) notes that the b coefficient may be conceived as the "potential influence" of
the independent on the dependent, while level importance may be conceived as the
"actual influence." This contrast is based on the idea that the higher the b, the more y
will change for each unit increase in b, but the lower the mean for the given
independent, the fewer actual unit changes will be expected. By taking both the
magnitude of b and the magnitude of the mean value into account, level importance is a
better indicator of expected actual influence of the independent on the dependent. Level
importance is not computed by SPSS.
The beta weights are the regression (b) coefficients for standardized data. Beta is the
average amount the dependent increases when the independent increases one
standard deviation and other independent variables are held constant. If an independent
variable has a beta weight of .5, this means that when other independents are held
constant, the dependent variable will increase by half a standard deviation (.5 also). The
ratio of the beta weights is the ratio of the estimated unique predictive importance of the
independents. Note that the betas will change if variables or interaction terms are added
or deleted from the equation. Reordering the variables without adding or deleting will not
affect the beta weights. That is, the beta weights help assess the unique importance of
the independent variables relative to the given model embodied in the regression
equation. Note that adding or subtracting variables from the model can cause the b and
38
39. beta weights to change markedly, possibly leading the researcher to conclude that an
independent variable initially perceived as unimportant is actually and important
variable. In SPSS, Analyze, Regression, Linear; click the Statistics button; make sure
Estimates is checked to get the beta coefficients (the default).
Note that the betas reflect the unique contribution of each independent variable. Joint
contributions contribute to R-square but are not attributed to any particular independent
variable. The result is that the betas may underestimate the importance of a variable
which makes strong joint contributions to explaining the dependent variable but which
does not make a strong unique contribution. Thus when reporting relative betas, one
must also report the correlation of the independent variable with the dependent variable
as well, to acknowledge if it has a strong correlation with the dependent variable.
Standardized means that for each datum the mean is subtracted and the result divided
by the standard deviation. The result is that all variables have a mean of 0 and a
standard deviation of 1. This enables comparison of variables of differing magnitudes
and dispersions. Only standardized b-coefficients (beta weights) can be compared to
judge relative predictive power of independent variables.
Note some authors use "b" to refer to sample regression coefficients, and "beta" to refer
to regression coefficients for population data. They then refer to "standardized beta" for
what is simply called the "beta weight" here.
Correlation:
Pearson's r2 is the percent of variance in the dependent explained by the given
independent when (unlike the beta weights) all other independents are allowed to vary.
The result is that the magnitude of r2 reflects not only the unique covariance it shares
with the dependent, but uncontrolled effects on the dependent attributable to covariance
the given independent shares with other independents in the model. A rule of thumb is
that multicollinearity may be a problem if a correlation is > .90 or several are >.7 in the
correlation matrix formed by all the independents.
The intercept,
Variously expressed as e, c, or x-sub-0, is the estimated Y value when all the
independents have a value of 0. Sometimes this has real meaning and sometimes it
doesn’t — that is, sometimes the regression line cannot be extended beyond the range
of observations, either back toward the Y axis or forward toward infinity. In SPSS,
Analyze, Regression, Linear; click the Statistics button; make sure Estimates is checked
to get the intercept, labeled the "constant" (the default).
MS EXCEL allows the researcher to check a box to not have an intercept. This is
equivalent to forcing the regression line to run through the origin. In rare cases the
researcher may know the relation is linear and that the dependent variable is zero when
all the independents are zero, in which case the option may be selected.
R2, also called multiple correlations or the coefficient of multiple determination, is the
percent of the variance in the dependent explained uniquely or jointly by the
independents. R-squared can also be interpreted as the proportionate reduction in error
in estimating the dependent when knowing the independents. That is, R2 reflects the
number of errors made when using the regression model to guess the value of the
dependent, in ratio to the total errors made when using only the dependent's mean as
the basis for estimating all cases. Mathematically, R2 = (1 - (SSE/SST)), where SSE =
error sum of squares = SUM ((Yi - EstYi) squared), where Yi is the actual value of Y for
the ith case and EstYi is the regression prediction for the ith case; and where SST = total
sum of squares = SUM ((Yi - MeanY) squared). The "residual sum of squares" in SPSS
output is SSE and reflects regression error. Thus R-square is 1 minus regression error
39
40. as a percent of total error and will be 0 when regression error is as large as it would be
if you simply guessed the mean for all cases of Y. Put another way, the regression sum
of squares/total sum of squares = R-square, where the regression sum of squares =
total sum of squares - residual sum of squares. In SPSS, Analyze, Regression, Linear;
click the Statistics button; make sure Model fit is checked to get R2.
Maximizing R2 by adding variables is inappropriate unless variables are added to the
equation for sound theoretical reason. At an extreme, when n-1 variables are added to a
regression equation, R2 will be 1, but this result is meaningless. Adjusted R2 is used as
a conservative reduction to R2 to penalize for adding variables and is required when the
number of independent variables is high relative to the number of cases or when
comparing models with different numbers of independents
Standard Error of Estimate (SEE), confidence intervals, and prediction intervals.
Confidence intervals around the mean are discussed in the section on significance. In
regression, however, the confidence refers to more than one thing. Note the confidence
and prediction intervals will improve (narrow) if sample size is increased, or the
confidence level is decreased (ex., from 95% to 90%).
For large samples, SEE approximates the standard error of a predicted value. SEE is
the standard deviation of the residuals. In a good model, SEE will be markedly less than
the standard deviation of the dependent variable. In a good model, the mean of the
dependent variable will be greater than 1.96 times SEE.
The confidence interval of the regression coefficient. Based on t-tests, the
confidence interval is the plus/minus range around the observed sample regression
coefficient, within which we can be, say, 95% confident the real regression coefficient
for the population regression lies. Confidence limits are relevant only to random sample
datasets. If the confidence interval includes 0, then there is no significant linear
relationship between x and y. We then do not reject the null hypothesis that x is
independent of y. In SPSS, Analyze, Regression, Linear; click Statistics; check
Confidence Limits to get t and confidence limits on b.
The confidence interval of y (the dependent variable) is also called the standard error
of mean prediction. Some 95 times out of a hundred, the true mean of y will be within
the confidence limits around the observed mean of n sampled cases. That is, the
confidence interval is the upper and lower bounds for the mean predicted response.
Note the confidence interval of y deals with the mean, not an individual case of y.
Moreover, the confidence interval is narrower than the prediction interval, which deals
with individual cases. Note a number of textbooks do not distinguish between
confidence and prediction intervals and confound this difference. In SPSS, select
Analyze, Regression, Linear; click Save; under "Prediction intervals" check "Mean" and
under "Confidence interval" set the confidence level you want (ex., 95%). Note SPSS
calls this a prediction interval for the mean.
The prediction interval of y. For the 95% confidence limits, the prediction interval on a
fitted value is plus/minus is the estimated value plus or minus 1.96 times SQRT (SEE +
S2
y), where S2
y is the standard error of the mean prediction. Prediction intervals are
upper and lower bounds for the prediction of the dependent variable for a single case.
Thus some 95 times out of a hundred; a case with the given values on the independent
variables would lie within the computed prediction limits. The prediction interval will be
wider (less certain) than the confidence interval, since it deals with an interval estimate
of cases, not means. In SPSS, select Analyze, Regression, Linear; click Save; under
"Prediction intervals" check "Individual" and under "Confidence interval" set the
confidence level you want (ex., 95%).
40
41. F test: The F test is used to test the significance of R, which is the same as testing the
significance of R2, which is the same as testing the significance of the regression model
as a whole. If prob(F) < .05, then the model is considered significantly better than would
be expected by chance and we reject the null hypothesis of no linear relationship of y to
the independents. F is a function of R2, the number of independents and the number of
cases. F is computed with k and (n - k - 1) degrees of freedom, where k = number of
terms in the equation not counting the constant.
F = [R2/k]/[(1 - R2 )/(n - k - 1)].
In MS EXCEL, the F test appears in the ANOVA table, which is part of regression
output. Note that the F test is too lenient for the stepwise method of estimating
regression coefficients and an adjustment to F is recommended (
Outliers are data points which lie outside the general linear pattern of which the midline
is the regression line. A rule of thumb is that outliers are points whose standardized
residual is greater than 3.3 (corresponding to the .001 alpha level). The removal of
outliers from the data set under analysis can at times dramatically affect the
performance of a regression model. Outliers should be removed if there is reason to
believe that other variables not in the model explain why the outlier cases are unusual --
that is, these cases need a separate model. Alternatively, outliers may suggest that
additional explanatory variables need to be brought into the model (that is, the model
needs respecification). Another alternative is to use robust regression, whose algorithm
gives less weight to outliers but does not discard them.
Multicollinearity is the intercorrelation of independent variables. R2's near 1 violate the
assumption of no perfect colinearity, while high R2's increase the standard error of the
beta coefficients and make assessment of the unique role of each independent difficult
or impossible. While simple correlations tell something about multicollinearity, the
preferred method of assessing multicollinearity is to regress each independent on all the
other
Assumptions
Proper specification of the model: If relevant variables are omitted from the model,
the common variance they share with included variables may be wrongly attributed to
those variables, and the error term is inflated. If causally irrelevant variables are
included in the model, the common variance they share with included variables may be
wrongly attributed to the irrelevant variables. The more the correlation of the irrelevant
variable(s) with other independents, the greater the standard errors of the regression
coefficients for these independents. Omission and irrelevancy can both affect
substantially the size of the b and beta coefficients. This is one reason why it is better to
use regression to compare the relative fit of two models rather than to seek to establish
the validity of a single model.
Linearity. Regression analysis is a linear procedure. To the extent nonlinear
relationships are present, conventional regression analysis will underestimate the
relationship. That is, R-square will underestimate the variance explained overall and the
betas will underestimate the importance of the variables involved in the non-linear
relationship. Substantial violation of linearity thus means regression results may be
more or less unusable. Minor departures from linearity will not substantially affect the
interpretation of regression output. Checking that the linearity assumption is met is an
essential research task when use of regression models is contemplated.
Nonlinear transformations. When nonlinearity is present, it may be possible to remedy
the situation through use of exponential or interactive terms. Nonlinear transformation of
selected variables may be a pre-processing step, but beware that this runs the danger
41
42. of overfitting the model to what are, in fact, chance variations in the data. Power and
other transform terms should be added only if there is a theoretical reason to do so.
Adding such terms runs the risk of introducing multicollinearity in the model. A guard
against this is to use centering when introducing power terms (subtract the mean from
each score). Correlation and unstandardized b coefficients will not change as the result
of centering.
Partial regression plots are often used to assess nonlinearity. These are simply plots
of each independent on the x axis against the dependent on the y axis. Curvature in the
pattern of points in a partial regression plot shows if there is a nonlinear relationship
between the dependent and any one of the independents taken individually. Note,
however, that whereas partial regression plots are preferred for illuminating cases with
high leverage, partial residual plots (below) are preferred for illuminating nonlinearities.
Simple residual plots also show nonlinearity but do not distinguish monotone from
nonmonotone nonlinearity. These are usually plots of standardized residuals against
standardized estimates of Y, the dependent variable. The plot should show a random
pattern, with no nonlinearity or heteroscedasticity. In jargon, this will show the error
vector is orthogonal to the estimate vector. Non-linearity is, of course, shown when
points form a curve. Non-normality is shown when points are not equally above and
below the Y axis 0 line. Non-homoscedasticity is shown when points form a funnel or
other shape showing variance differs as one moves along theY axis.
Non-recursivity. The dependent cannot also be a cause of one or more of the
independents. This is also called the assumption of non-simultaneity or absence of joint
dependence. Violation of this assumption causes regression estimates to be biased and
means significance tests will be unreliable.
No overfitting. The researcher adds variables to the equation while hoping that adding
each significantly increases R-squared. However, there is a temptation to add too many
variables just to increase R-squared by trivial amounts. Such overfitting trains the model
to fit noise in the data rather than true underlying relationships. Subsequent application
of the model to other data may well see substantial drops in R-squared.
Cross-validation is a strategy to avoid overfitting. Under cross-validation, a sample
(typically 60% to 80%) is taken for purposes of training the model, then the hold-out
sample (the other 20% to 40%) is used to test the stability of R-squared. This may be
done iteratively for each alternative model until stable results are achieved.
Unbounded data are an assumption. That is, the regression line produced by OLS can
be extrapolated in both directions but is meaningful only within the upper and lower
natural bounds of the dependent.
Data are not censored, sample selected, or truncated. There are as many
observations of the independents as for the dependents. Collapsing an interval variable
into fewer categories leads to attenuation and will reduce R2.
Absence of perfect multicollinearity. When there is perfect multicollinearity, there is
no unique regression solution. Perfect multicollinearity occurs if independents are linear
functions of each other (ex., age and year of birth), when the researcher creates dummy
variables for all values of a categorical variable rather than leaving one out, and when
there are fewer observations than variables.
Absence of high partial multicollinearity. When there is high but imperfect
multicollinearity, a solution is still possible but as the independents increase in
correlation with each other, the standard errors of the regression coefficients will
become inflated. High multicollinearity does not bias the estimates of the coefficients,
42
43. only their reliability. This means that it becomes difficult to assess the relative
importance of the independent variables using beta weights. It also means that a small
number of discordant cases potentially can affect results strongly. The importance of
this assumption depends on the type of multicollinearity. In the discussion below, the
term "independents" refers to variables on the right-hand side of the regression
equation other than control variables.
Normally distributed residual error: Error, represented by the residuals, should be
normally distributed for each set of values of the independents. A histogram of
standardized residuals should show a roughly normal curve. An alternative for the same
purpose is the normal probability plot, with the observed cumulative probabilities of
occurrence of the standardized residuals on the Y axis and of expected normal
probabilities of occurrence on the X axis, such that a 45-degree line will appear when
observed conforms to normally expected. The F test is relatively robust in the face of
small to medium violations of the normality assumption. The central limit theorem
assumes that even when error is not normally distributed, when sample size is large,
the sampling distribution of the b coefficient will still be normal. Therefore violations of
this assumption usually have little or no impact on substantive conclusions for large
samples, but when sample size is small, tests of normality are important.
Additivity. Likewise, regression does not account for interaction effects, although
interaction terms (usually products of standardized independents) may be created as
additional variables in the analysis. As in the case of adding nonlinear transforms,
adding interaction terms runs the danger of overfitting the model to what are, in fact,
chance variations in the data. Such terms should be added only when there are
theoretical reasons for doing so. That is, significant but small interaction effects from
interaction terms not added on a theoretical basis may be artifacts of overfitting. Such
artifacts are unlikely to be replicable on other datasets.
Homoscedasticity: The researcher should test to assure that the residuals are
dispersed randomly throughout the range of the estimated dependent. Put another way,
the variance of residual error should be constant for all values of the independent(s). If
not, separate models may be required for the different ranges. Also, when the
homoscedasticity assumption is violated "conventionally computed confidence intervals
and conventional t-tests for OLS estimators can no longer be justified" (Berry, 1993: 81).
However, moderate violations of homoscedasticity have only minor impact on
regression estimates (Fox, 2005: 516).
No outliers. Outliers are a form of violation of homoscedasticity. Detected in the
analysis of residuals and leverage statistics, these are cases representing high
residuals (errors) which are clear exceptions to the regression explanation. Outliers can
affect regression coefficients substantially. The set of outliers may suggest/require a
separate explanation. Some computer programs allow an option of listing outliers
directly, or there may be a "case wise plot" option which shows cases more than 2 s.d.
from the estimate. To deal with outliers, the researcher may remove them from analysis
and seek to explain them on a separate basis, or transforms may be used which tend to
"pull in" outliers. These include the square root, logarithmic, and inverse (x = 1/x)
transforms.
Reliability: Reliability is reduced by measurement error and, since all variables have
some measurement error, by having a large number of independent variables. To the
extent there is random error in measurement of the variables, the regression
coefficients will be attenuated. To the extent there is systematic error in the
measurement of the variables, the regression coefficients will be simply wrong. (In
43
44. contrast to OLS regression, structural equation modeling involves explicit modeling of
measurement error, resulting in coefficients which, unlike regression coefficients, are
unbiased by measurement error.) Note measurement error terms are not to be confused
with residual error of estimate, discussed below.
Population error is uncorrelated with each of the independents).
This is the "assumption of mean independence": that the mean error is independent of
the x independent variables. This is a critical regression assumption which, when
violated, may lead to substantive misinterpretation of output.
The (population) error term, which is the difference between the actual values of the
dependent and those estimated by the population regression equation, should be
uncorrelated with each of the independent variables. Since the population regression
line is not known for sample data, the assumption must be assessed by theory.
Specifically, one must be confident that the dependent is not also a cause of one or
more of the independents, and that the variables not included in the equation are not
causes of Y and correlated with the variables which are included. Either circumstance
would violate the assumption of uncorrelated error. One common type of correlated
error occurs due to selection bias with regard to membership in the independent
variable "group" (representing membership in a treatment vs. a comparison group):
measured factors such as gender, race, education, etc., may cause differential selection
into the two groups and also can be correlated with the dependent variable. When there
is correlated error, conventional computation of standard deviations, t-tests, and
significance are biased and cannot be used validly. Note that residual error -- the
difference between observed values and those estimated by the sample regression
equation -- will always be uncorrelated and therefore the lack of correlation of the
residuals with the independents is not a valid test of this assumption.
Independent observations (absence of autocorrelation) leading to uncorrelated
error terms. Current values should not be correlated with previous values in a data
series. This is often a problem with time series data, where many variables tend to
increment over time such that knowing the value of the current observation helps one
estimate the value of the previous observation. Spatial autocorrelation can also be a
problem when units of analysis are geographic units and knowing the value for a given
area helps one estimate the value of the adjacent area. That is, each observation
should be independent of each other observation if the error terms are not to be
correlated, which would in turn lead to biased estimates of standard deviations and
significance.
By accepting all the assumptions and understanding the technicalities of the multiple
regression model, it has been unanimously decided that multiple regression model
should be used. As demand for the pharma products is affected by the various
parameters with less or more concentration. So, it has been decided to work to
construct the multiple regression model for the demand forecasting.
So, there were certain steps to be taken. First of all proper software should be selected
to apply the multiple regression model on the product basket of 500+ products.
It was found that MS Excel has the facility to apply the multiple regression with using
certain number of parameters. Let’s learn first how to use Multiple Regression function
in MS Excel.
44
45. Multiple Regression with MS Excel
To do regression in Excel, you need the Analysis Toolpak add-in to be installed in
Excel. This was an option when you installed Excel, but you might not have selected it.
If you didn't install it, Excel will ask you for the CD, when you try to add the toolpak.
Check that the add-in is installed, and added-in, by choosing Add-ins from the tools
menu (as shown below).
Then ensure that "Analysis ToolPak" is selected, as shown below.
You can now use the data analysis functions in Excel, which include multiple
regression.
The example that we will work through is taken from dataset 6.1b in the book "Applying
regression and correlation" (if you jumped straight in here, that is what these web pages
is about.
To get to the data analysis function in Excel, you select the Tools menu, and then
choose Data Analysis.
45
46. This gives the following Dialog, click on Regression and then click OK.
The following dialog appears:
In here, we tell Excel about the data that we would like to analyze.
The first box is the input Y range. Here, we tell Excel about our dependent variable.
The dependent variable must be a column, 1 cell wide and N cells long (where N is the
number of individuals that we are analyzing).
The dataset we are using, the dependent variable is An, which is the column which
goes from cell D1 to Cell D41. You can either type this information in directly as
D1:D41, or you can select the appropriate data from the spreadsheet.
Because we have included row 1, which includes the variable name, we are going to
have to tell Excel this, by clicking on the "Labels" checkbox.
46
47. The next stage is to input the independent variables. The independent variables must
be a block of data, of k columns (where k is the number of independent variables) and N
rows (where N is still the number of people). In the dataset we are using we have three
independent variables: hassles, hassles2 and hassles3. (These represent the linear,
quadratic and cubic effects of hassles - we are analyzing a non-linear relationship here,)
These are held in rows 1 - 41 of columns A, B and C. Again, we can type in A1:C41 or
select the data from the spreadsheet - it will have the same effect.
Next we tell Excel where we want the results to be written. It is best to ask for a new
sheet - you don't want to accidentally overwrite some of your precious data, and have to
go to all of the effort of restoring it from a backup, do you? (You do have a backup,
don't you?)
We can ask fro residuals and standardized residuals to be saved - these will be new
columns of numbers created in the new spreadsheet.
Two types of graphs will be drawn automatically if you ask for them.
· A residual plot will draw scatter plots of each independent variable on the x-axis,
and the residual on the y-axis.
· A line fit plot will draw scatter plots of each independent variable on the x-axis,
and the predicted and actual values of the dependent variable on the y axis.
· You cannot, as far as I have been able to determine, automatically have
· A scatter plot with the predicted values on the x-axis, and the residuals on the y-axis
(although you can calculate these values and save them.)
You can also request a normal probability plot. This appears to be a plot of the
dependent variable, which is a curious thing to plot - regression analysis does not
assume normal distribution of the dependent variable. The usual plot of this type would
be the residuals, but this is not possible in Excel.
The dialog box now looks like this:
47
48. .
So, finally, we click OK.
And we get a lot of output, written to a new sheet. A note about this output - output from
analysis in Excel is usually "live" that is to say, the data are linked to the output. If you
change the data, you will change the output. This is not the case for this type of output
in Excel. The results of the analysis are "dead" and will not change.
Regression Statistics
The first part of the output is the regression statistics. These are standard statistics
which are given by most programs.
ANOVA
The ANOVA table comes next. This gives a test of significance of the R2. Note that
Excel uses scientific notation, by default, so when it says 2.22E-08 it means, 2.22 * 10-
8 . (i.e. 0.0000000222).
ON the next page is shown the summary output given by the regression function in MS
EXCEL.
48
49. Summary Output
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
ANOVA
df SS MS F Significance F
Regression
Residual
Total
49
50. Coefficients Standard
Error
t Stat P-value Lower 95% Upper 95% Lower
95.0%
Upper
95.0%
Intercept
X1
X2
X3
X4
X5
X6
X7
…
50
52. Coefficients
The next stage is the coefficients. Note that here I have converted the numbers to 2
decimal places to save space). It gives the coefficient for each parameter, including
the intercept (the constant). The standard errors, and the t-values follow (the t-value is
the coefficient divided by the standard error). Next comes the p-value associated with
the variable, and the confidence intervals of the parameter estimates (Excel gave these
to me twice, even though I didn't ask for them.)
Residuals
The final part of the output is the residual information. The observation in the left had
column is the case number - although Excel never told us about this, it has labeled the
first person Observation 1, the second Observation 2, etc. (Note that this is NOT the
original row number - Observation 1 was row 2).
The predicted anxiety score is the score that was predicted from the regression
equation. The residual is the raw residual - that is the difference between the predicted
score and the actual score on the dependent variable. The final value is the
standardized residual (the residuals adjusted to ensure that they have a standard
deviation of 1; they have a mean of zero already).
Graphs
finally we will have a quick look at the graphs.
The first graph is an example of the residual plots - it has hassles on the x-axis and the
unstandardized residual on the y-axis.
The second graphs show the predicted and actual anxiety scores plotted against hassles3.
52
53. By using MS Excel it is possible to apply the Multiple Regression function, as stated
above.
Limitation of Regression function
Regression function gives the sheet, which doesn’t change. It is known as a dead sheet.
This doesn’t fit into our criteria.
A dynamic function is needed which gives the output which changes, as data changes.
By a validation list, data is made changed along with the SKUs.
By Regression function, we are not getting an output which changes with the SKU.
It is not possible to create summary output for the entire product basket.
Hence, another function is used to get the changing output.
A function called LINEST (Linear Estimation) is used.
LINEST
Calculates the statistics for a line by using the "least squares" method to calculate a
straight line that best fits your data, and returns an array that describes the line.
Because this function returns an array of values, it must be entered as an array formula.
The equation for the line is:
y = mx + b or
y = m1x1 + m2x2 + ... + b (if there are multiple ranges of x-values)
Where the dependent y-value is a function of the independent x-values. The m-values
are coefficients corresponding to each x-value, and b is a constant value. Note that y, x,
and m can be vectors. The array that LINEST returns is {mn,mn-1,...,m1,b}. LINEST can
also return additional regression statistics.
Syntax
LINEST(known_y's,known_x's,const,stats)
Known_y's is the set of y-values you already know in the relationship y = mx + b.
53
54. If the array known_y's is in a single column, then each column of known_x's is
interpreted as a separate variable.
If the array known_y's is in a single row, then each row of known_x's is interpreted as a
separate variable.
Known_x's is an optional set of x-values that you may already know in the relationship
y = mx + b
.
The array known_x's can include one or more sets of variables. If only one variable is
used, known_y's and known_x's can be ranges of any shape, as long as they have
equal dimensions. If more than one variable is used, known_y's must be a vector (that
is, a range with a height of one row or a width of one column).
If known_x's is omitted, it is assumed to be the array {1, 2,3,...} that is the same size as
known_y's.
Const is a logical value specifying whether to force the constant b to equal 0.
If const is TRUE or omitted, b is calculated normally.
If const is FALSE, b is set equal to 0 and the m-values are adjusted to fit y = mx.
Statistics are a logical value specifying whether to return additional regression statistics.
If stats is TRUE, LINEST returns the additional regression statistics, so the returned
array is {mn,mn-1,...,m1,b;sen,sen-1,...,se1,seb;r2,sey;F,df;ssreg,ssresid}.
If stats is FALSE or omitted, LINEST returns only the m-coefficients and the constant b.
The additional regression statistics are as follows.
Statistic Description
se1,se2,...,sen The standard error values for the coefficients m1,m2,...,mn.
seb The standard error value for the constant b (seb = #N/A when const is
FALSE).
r2 The coefficient of determination. Compares estimated and actual y-values,
and ranges in value from 0 to 1. If it is 1, there is a perfect
correlation in the sample— there is no difference between the
estimated y-value and the actual y-value. At the other extreme, if the
coefficient of determination is 0, the regression equation is not helpful
in predicting a y-value. For information about how r2 is calculated, see
"Remarks" later in this topic.
sey The standard error for the y estimate.
F The F statistic or the F-observed value. Use the F statistic to determine
whether the observed relationship between the dependent and
independent variables occurs by chance.
df The degrees of freedom. Use the degrees of freedom to help you find
F-critical values in a statistical table. Compare the values you find in
the table to the F statistic returned by LINEST to determine a
confidence level for the model. For information about how df is
calculated, see "Remarks" later in this topic. Example 4 below shows
use of F and df.
SSreg The regression sum of squares.
SSresid The residual sum of squares. For information about how ssreg and
54
55. ssresid are calculated, see "Remarks" later in this topic.
The following illustration shows the order in which the additional regression statistics are
returned.
Statistics given by function
coeff(n) coeff(n-1) coeff(n-2) coeff(n-3) ……
se(n) se(n-1) se(n-2) se(n-3) ……
coeff of det S.E
F stats d.f.
SS reg SS resid
Fitting Multiple Regression Model
AT SCM dept, DPC plays with vast and scattered product basket. Product basket
contains various drugs in the form of tablets, capsules, vials and bottles. Various drugs
are combination of the different molecules. Product belongs to the different molecule
classes. As we have discussed and got certain numbers of parameters which can affect
the actual sales, each parameter has to be checked out for its impact on the actual
sales.
We have the question of including parameters in to the model as an independent
parameter.
One should check out the significance and validity of the parameter. After deciding all
those criteria, a decision should be taken as to which parameter should be included.
ASSUMPTIONS made
Parameters taken into considerations are least correlated
Multiple regression model follows all the assumption of the correlation.
Data, which are collected, is accurate.
Future estimates of the parameters are true.
There is no intercept considered.
Data Sources
SAP data files- SAP data files are the files which are extracted from the SAP.
As SAP contains all the data regarding the sales, orders, availability, field targets,
institutional sales and what not! SAP contains past data in every form in which it is
needed. Generally these data are fed into SAP in the past. So to get the data, SAP is
used and data files are used as the data source. Thus, SAP data files are the internal
source of the data.
ORG-MARG DATA
ORG-MARG is the market research company. They collect the sales data from retail
counters. Data collected by ORG people is product specific, company specific, industry
specific, market specific.
55
56. Data used for the project is of Pharmaceuticals’ sector.
TORRENT is a subscriber of the ORG Data. TORRENT uses the org data for the
market research and analysis purpose.
There is a separate cell at TORRENT, which deals with the ORG data. ORG data is
replenished on every month for the recent past month by the ORG-MARG.
ORG MARG has the dedicated software, which are used to get the data in the form as it
is needed.
ORG data is available on the market basis.
Data available has shown the hierarchies as shown below in the graph.
ORG data is available on the monthly as well as yearly basis.
ORG data is available in the units (strips) and value. They also give the company market
share, company market growth, molecule growth, molecule class growth, and company’s
share in the particular sector. It also provides the statistics in terms of years. How much
market share does the company have? How much does it have gained or lost?
ORG provides the data a month later i.e. in the month of June, it provides the data of the
month of May.
56
Market
Pharmaceuticals
Molecule class
Tranquilizers
Molecule
Aalprazolam
Pack wise
Alprax .5 tab
Alprax Sr 0.5 Tab
Strength wise
Alprax (0.5)
(All the products
consisting
0.5 strength)
Brand
Alprax