proposal

UNIVERSITY OF RWANDA
COLLEGE OF SCIENCE AND TECHNOLOGY
SCHOOL OF SCIENCES
DEPARTMENT OF MATHEMATICS
STATISTICS OPTION
Final year student project
Title: Statistical Modeling and Forecasting of rainfall data in Rwanda
Name:Felix MUCYO
Under the guidence of:
May 17, 2016

2
Certiﬁcation
This is to certify that the study entitled Statistical Modeling and Forecasting of Rainfall
data in Rwanda is a record of an original work done by MUCYO Felix (GS20130468) for
the partial fulﬁllment of the requirements of award of Bachelor with honor in Mathematics
(Statistics) in University of Rwanda, College of Science and Technology (CST) during the
academic year 2015-2016.

3
Acknowledgement
This research has efficiently been completed due to the support and participation of different
people to whom I would like to thank. I would like to express my heartfelt gratitude
to Jean Paul NSABIMANA who agreed to lead my study despite to his many duties.
His guidance, remarks and relevant suggestions are of a great help for the framework of
this study. My sincere gratitude goes to RWANDA METEOROLOGY AGENCY
(METEO RWANDA) that made available the monthly rainfall data from 1971 to 2014
of KIGALI-AERO station of differing lengths and completeness of record.Their tremendous
support in career building and hands on skills has critically made this study fruitful.

4
Abstract
Rainfall is a highly variable parameter that varies in scales from few meters to several kilo-
meters. The importance of accurate rainfall observation and forecast are widely recognized.
In this research study entitled ‘Statistical Modeling and Forecasting of Rainfall data
in Kigali City‘. Rainfall data from KIGALI-AERO station will be modeled by time series
methods and generalized linear model to build an adequate model that can eﬀectively be
used to predict the future events basing those previous observation.

Contents
Certiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1 GENERAL INTRODUCTION 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Purpose of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Objectives of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5.1 Main Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5.2 Speciﬁc Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.7 Hypothesis of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 LITERATURE REVIEW 6
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Basic Concepts in Analysis of Rainfall Time Series Data . . . . . . . . . . . 9
2.3 Review of Rainfall Measurement and Forecasting Methods . . . . . . . . . . 10
2.4 Generalized linear model for Rainfall variability . . . . . . . . . . . . . . . . 12
2.5 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 METHODOLOGY 15
3.1 Data Source and Sample size . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Time Series Method for Data Analysis . . . . . . . . . . . . . . . . . . . . . 15
5

CONTENTS 6
3.2.1 Box-Jenkins Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Time Series Model Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 Autoregressive Moving Average ARMA (p, q) model . . . . . . . . . 18
3.4 Generalized linear model for Rainfall . . . . . . . . . . . . . . . . . . . . . . 19
4 RESULTS AND ANALYSIS 21
4.1 Data presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.1 Preliminary Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Fitting the appropriate time series model of yaerly average data . . . . . . . 22
4.2.1 Identifying potential model . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.2 Model ﬁtting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.3 model cheking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 RESULTS AND INTERPRETATIONS . . . . . . . . . . . . . . . . . . . . 26
5 DISCUSSION, CONCLUSION AND RECOMMENDATION 28
5.1 DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 RECOMMENDATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Chapter 1
GENERAL INTRODUCTION
1.1 Introduction
In recent years, Rainfall data analysis became the main domain of focus for many scientists,
bureaucrats, professionals, civil societies, NGOs etc. Extreme and variability of rainfall is
one of the main causes of natural disasters, especially in flood hazards worldwide. Not sur-
prisingly, considerable attention has been paid in recent years to the modeling and forecast-
ing of extreme rainfall to help in preventing flood hazards, and for analyzing water-related
structures, agriculture, and monitoring climate changes.
Interactions between the various components of the climate system such as the oceans,
land and atmosphere have brought about climate change. This is characterized by rainfall
variability which brings with its negative impacts to the countries economies. This has ne-
cessitated efforts to understand the coherent multi decadal fluctuation in the global climate
change and make predictions of rainfall extremes.
The climatological time-series generation is an essential task in order to describe the be-
havior of the data, obtain estimates and predictions given the observed values or simulate
series under the same statistical pattern. Precipitation time-series generation is used for
different purposes: to assess the impact of rainfall variability on water resource system, for
hydrological modeling and water management issues.
In rainfall series, usually characterized by a large variability (not always), there is a need for
finding suitable models that correctly capture the data behavior. The precipitation amounts
are usually estimated based on the assumption that they follow a certain theoretical proba-
bility distribution. Several such distributions have been used in weather generators, such as
1

CHAPTER 1. GENERAL INTRODUCTION 2
the one-parameter exponential distribution, two parameter gamma, Weibull distributions
and three-parameter skewed normal and mixed exponential distributions.
1.2 Background
Rwanda is a small, mountainous country with relatively high rainfall, and is situated in
Central Africa, bordered by Burundi in the South, the Democratic Republic of Congo in
the West, Uganda in the North and Tanzania in the East. The total area of Rwanda is
about 26, 338 km2 out of which about 24, 948 km2 islands and 1, 390 km2 are covered by
water (5.3%). In 2007 the population of Rwanda was estimated to be 9.3 million, this gives
an estimated population density of about 342 persons over km2, the highest in Africa [1].
Rwanda experiences a bimodal pattern of rainfall, which is driven primarily by the pro-
gression of the Inter-Tropical Convergence Zone (ITCZ). The ITCZ follows the annual
progression of the sun as it goes to the Northern Summer solstice about June 23, and the
Southern Summer solstice about December 23 each year. The ’long rains’ occur over March,
April and May (MAM) and the ’short rains’ occur in September, October, November and
December (SOND).
The mean annual rainfall in Rwanda is about 1120 mm and varies from 700 mm in the
North-West to about 1600 mm/year in the South-West. The mean altitude is 1250 m above
sea level with a general slope oriented from west to east. The altitude increases progressively
from the south-eastern plateau to the north and west where it gets the highest altitudes in
the Congo-Nile Crest with elevations varying between 2200 m and 3000 m and the chain of
volcanoes with the highest point of 4507 m at the Karisimbi volcano[1].
Rwanda has heavy annual rainfall which has always been used for seasonal agriculture
(Rwanda, 2011). On the other hand, seasonal agriculture is vulnerable to climate change,
because even slight changes in precipitation can have signiﬁcant perturbation in agriculture
crops and livestock production. Munyaneza et al.(2009)[1] used daily rainfall records to
present an overview of spatial and temporal variability of temperature, precipitation and
stream ows data variability in Rwanda .
Rwanda has a moderate climate with an annual average temperature of 190 C. It is divided
into three agro-climatic zones namely: i) high-altitude region , ii) central plateau, and iii)
plateau of eastern lowlands and the west. The country has an annual cycle of four seasons

that are distributed as follows:
• A short rainy season, locally known as Umuhindo runs from September to November,
with November characterized by heavy precipitation;
• A short dry season, locally known as Urugaryi runs from December to February;
• A long rainy season, locally known as Itumba runs from March to May, this bringing
about 14 to 61% of the total annual precipitation;
• A long dry season, locally known as Icyi runs from June to August.
In Eastern Africa, Schreck and H.M.Semazzi (2003)[] used the rainfall records to
examine the recent variability of the Eastern African climate.The analysis of rainfall data
derived from a combination of rain gauge observations showed that the most dominant mode
of variability over the Eastern Africa climate corresponds to El Nino Southern Oscillation
(ENSO) climate variability.This study was based on the construction of empirical orthog-
onal functions of gauge rainfall data drawn in the Great horn of Africa(GHA) countries
which are Burundi,Tanzania,Rwanda,Uganda, Kenya, Ethiopia, Djibouti, Eritrea, Sudan
and Somalia[2].
1.3 Problem statement
Weather and rainfall forecasting is one of the most important components of water resource
management for decision making and performing strategic planning especially in agricul-
tural sectors. The ability to accurately predict and forecast rainfall quantitatively can help
crop planting decisions, reservoir water resource allocation, traﬃc control, the operation of
sewer systems and confronting water related problems such as ﬂooding hazards and draught
especially in Countries such as Rwanda where agriculture contributes much to the wealth
and economy of the country. Therefore, an accurate forecast of rainfall will help in natural
disaster mitigation.Nonetheless rainfall is one of the most complex and challenging com-
ponents of the hydrology cycle to comprehend and to forecast due to the various dynamic
environmental factors and random variations both spatially and temporally.

1.4 Purpose of the Research
The purpose of this research is to assist the Rwanda Meteorology Agency (METEO RWANDA)
to contribute towards a duly analysis in simulation of rainfall in order to predict the future
state of the weather especially rainfall. By adopting and use the mathematical models at
a given location so that we can plan the activities and avoid them being interrupted by
weather conditions.
1.5 Objectives of the Study
1.5.1 Main Objective
The main objective of this study is to build a suitable model for rainfall data. In order to
attain a duly forecast of future value of events basing on provided past observation and also
for optimal control of events.
1.5.2 Specific Objectives
The Specific objectives of this study are:
• To build a suitable model of Rainfall data from KIGALI-AERO station.
• To identify, fit, check the aptly approved model.
• To find out the adequacy of the model.
• To determine if there is a change in the rainfall pattern in the areas of study over the
30 year period.
• To predict average expected monthly and yearly rainfall data in the Kigali city based
on past observation of KIGALI-AERO station.
• To ascertain the analogy between time series and generalized linear models for rainfall
data.
1.6 Research Questions
• How to monitor, analyze and advise on global climate change in rainfall variability
through the application of statistical approaches?

• How to approve the goodness of fit of certain model using time series techniques that
will consequently be relied on to predict weather and climate change basing on past
observation?
• How to support high mobility by Providing routine forecasts and information services
to facilitate agriculture and the establishment of infrastructure with help of modeling
of rainfall data?
• How to improve the safety of life and property through better application of statistical
approaches (time series techniques) on weather, water and climate warnings and the
forecasting of rainfall?
1.7 Hypothesis of the Study
Since the predictability of rainfall in given areas is uncertainty, there should be a statistical
approaches in order to deliver a duly forecast and integrate it into planning and decision
making process to prevent casualties and damages that may be caused by natural disas-
ters [3].
• If the statistical modeling methods of previous rainfall data can produce an ultimately
best model then these model can be used to predict what future will likely to be.
• If the rainfall data can definitely be modeled to produce a best model that will yield
a best model to forecast future values then statistical modeling can be regarded as
one way that will mathematically exert to forecast rainfall of a given areas.
• If the methods of prediction of extreme rainfall have often been based on studies of
physical effects of rainfall or on statistical studies of rainfall time series then the past
observations can show if there is an annual trends, seasonality, cycles and other form
of time series.

Chapter 2
LITERATURE REVIEW
2.1 Overview
In this section the research review that was previously written about the topic was reviewed.
It reflects the number of studies carried out by several researchers and authors. Hence it
reviews the relevant literature resulting from the work of interested researchers and put
forward in books, reports, and any other published documents that were available in this
study. Many researchers aimed to exploit the use of statistical approaches and modeling of
rainfall data to forecast what future events will likely to be basing on previous observation
of rainfall data.
There have been many attempts to forecast rainfall. Some authors used the MARKOV
chain modeling approach for synthetic generation of rainfall data. Thomas and Fier-
ing (2001)have used a first order MARKOV chain model to generate stream flow data.
Srikanthan and Mohan (2000) have used and have recommended a first order MARKOV
chain model to generate annual rain-fall data. However a few studies have been done on
the synthetic generation of rain-fall flow data using ARMA approach. ARMA approach is
generally used for modeling and simulation of rain-fall flow data. Min et al. (2010), used
Autoregressive Integrated Moving Average (ARIMA) with integration model (also known as
integration analysis) to evaluate the impact of different local, regional and global incidents
of a man-made,natural and health character, in Taiwan over the last decade.
Rainfall forecasting can apply to many time horizons such as short term, medium term,
and long term periods. Some authors design systems which can forecast yearly data, some
try to forecast monthly data whereas some try to forecast daily data . Most of them
6

CHAPTER 2. LITERATURE REVIEW 7
concentrate on one-step-ahead prediction. If multi-step prediction is then required, many
iterations of one-step-ahead can be performed. The accuracy of the forecasts would of
course decrease with the number of such iterations.The traditional techniques for statistical
weather forecasting include ARMA models, Box-Jenkins Models and Multivariate Adaptive
Regression Splines [4].
In comparison to other countries in Sub Sahara Africa, there were few researchers interested
in studying the historical rainfall data from Rwanda. The Government of Rwanda suggests
that Rwanda has heavy annual rainfall which has always been used for seasonal agriculture
(Rwanda, 2011). On the other hand, seasonal agriculture is vulnerable to climate change,
because even slight changes in precipitation can have significant perturbation in agriculture
crops and livestock production. Munyaneza et al. (2009) used daily rainfall records to
present an overview of spatial and temporal variability of temperature, precipitation and
stream ows data variability in Rwanda. They showed that the daily rainfall and stream ows
have a significant variability in a specific year(1985) and between years. It is important to
mention that this study was carried out on daily rainfall data from different stations in the
period of (1910-2008).
The framework of each study base on nothing more than planning for obtaining useful
information on key quality characteristics produced by your process. The choice of method
of data collection can be influenced by the data collection strategy, the type of variable, the
accuracy required, the collection point and the skills of the enumerator.
Time series data corresponds to the sequence of values for a single variable in ordinary
data analysis. Each case in the data represents an observation at a different, time the
observations must be taken at equally spaced time intervals.
2.1.1 Forecasting
One of the main objectives in investigating a time series is forecasting. This can be using
through the simplest model which adequately describes the behavior of the observed variable
and the required forecast. Forecasting is a planning tool that helps management in its
attempts cope with the uncertainty of the future, relying mainly on data from the past and
present and analysis of trends.
Forecasting is designed to help decision making and planning in the present for the future.
It empowers people because their use implies that we can modify variables now to alter

(or be prepared for) the future. These estimates are projected into the coming months or
years using one or more techniques such as Box-Jenkins models, Delphi method, Exponen-
tial smoothing, moving average, Regression analysis and generalized linear model. Since
any error in the assumptions will result in a similar or magnified error in forecasting, the
technique of sensitivity analysis is used which assigns a range of values to the uncertain
factors (variables) [5].
Some assumptions about forecasting are:
• No way to state what the future will be with complete certainty.
• Regardless the methods used there will always be an element of uncertainty until the
forecast horizon has come to pass.
• Forecasts to policy-makers will help them formulate new social policy. Which, in turn,
will affect the future
Forecasting has application in many situations such as:
• In Weather forecasting, Economic forecasting, Earthquake prediction, Land use fore-
casting, Product forecasting, Player and team performance etc.
• It is also applicable tactical planning and or strategic planning.
Schematically, the main sources of forecast errors can be classified as follows:
• Observations (incomplete data coverage, representativeness errors, measurement er-
rors);
• Models (errors due to, e.g., the parameterization of physical processes, the choice of
closure approximations, and the effect of unresolved scales);
• Data assimilation procedures (errors due to, e.g., the use of a background covariance
model that assumes isotropy and the lack of knowledge of the background errors);
• Imperfect boundary conditions (e.g., errors due to the imperfect estimation and de-
scription of roughness length, soil moisture, snow cover, vegetation properties, and
sea surface temperature).
The method for forecasting depends on the assumption that the forecast errors are:

• Normally distributed with mean zero and constant variance and
• The auto correlations at lag ≥ 1 are all zero (they are independent)
The above assumptions can be checked through graphical method (normal curve) and ob-
taining the correlogram and carrying out the Ljung-Box test, Ljung-Box test is a test of the
Null hypothesis H0: ρ1 = ρ2 = . . . = ρk = 0. The test statistics here is
Q = n(n + 2)
h
j=1
rj2
n − j
, (2.1)
Where n is a Sample size, rj = Sample auto correlation at lag j, h is a number of lags being
tested. Then for α level of signiﬁcance, the critical region for rejection of the hypothesis of
no correlation in all lag (randomness) is rejected if
Q > X2
1−α,h
2.2 Basic Concepts in Analysis of Rainfall Time Series Data
The special feature of time series analysis is the fact that successive observations are usually
dependent and that the analysis must take into account the time order of the observations.
When successive observations are dependent, future values may be predicted from past
observations. A time series is said to be stationary if there is no systematic change in mean
(no trend), if there is no systematic change in variance and if strictly periodic variations have
been removed. Much of the probability theory of time series is concerned with stationary
time series, and for this reason time series analysis often requires one to transform a non-
stationary series into a stationary one so as to use this theory. Generally, a time series
analysis consists of two steps:
• Building a model that represents a time series, and
• Using the model to predict (forecast) future values.
If a time series has a regular pattern, then a value of the series should be a function of
previous values. If Y is the target value that we are trying to model and predict, and Yt is
the value of Y at time t, then the goal is to create a model of the form:
Yt = f (Yt−1, Yt−2, . . . , Yt−n) + et (2.2)

Where Yt−1 is the value of Y for the previous observation, Yt−2 is the value two obser-
vations ago, etc., and et represents error that does not follow a predictable pattern (this
is called a random shock). The main objective in investigating a time series is forecasting
future values of the observed series. This can be done through the model which adequately
describes the behavior of the observed variable and the required forecast.Values of variables
occurring prior to the current observation are called lag values. If a time series follows
a repeating pattern, then the value of Yt is usually highly correlated with Yt -cycle where
cycle is the number of observations in the regular cycle. For example, monthly observations
with an annual cycle often can be modeled by Yt = f(Yt − 12).
The goal of building a time series model is the same as the goal for other types of predictive
models which is to create a model such that the error between the predicted value of the
target variable and the actual value is as small as possible [4].
2.3 Review of Rainfall Measurement and Forecasting Meth-
ods
Precipitation measurement is mostly by ground-based rain gauge measurement of total
rainfall depth.Ground-based rain gauges capture precipitation, recording the total amount
as rainfall depth, usually in millimeters. The temporal measurement resolution can be
high, but accuracy can be site dependent and unreliable in extreme weather and can in-
clude melted snow or hail in addition to rain. Radar measurements, by contrast, detect
low-altitude atmospheric water content and have excellent spatiotemporal resolution, but
current geographic coverage is restrictive and the historical record is short [5].
Rainfall forecasting models depend on the application. Thunderstorms implicated in ash
oods typically take place on scales of minutes to hours [6] , requiring forecasts on the
shortest time scales. Localized ooding often occurs when medium to heavy rain falls in
the same location over several days, inundating rivers and urban drains, requiring forecasts
from hours to days. Predicting droughts requires forecasts on longer time scales of weeks
to months.
Numerical weather predictions (NWP)s are highly complex, nonlinear systems produc-
ing a single or a set (ensemble) of point forecasts, allowing the anticipation of distinct
meteorological events.Numerical weather prediction(NWP)solves equations of atmospheric

dynamics and produces rainfall predictions. Calibrated against atmospheric measurements,
they vary in spatial scale from synoptic (on the order of 1000 km) to mesoscale (approxi-
mately 50 km); current operational models have minimum resolutions of approximately 1.3
km (limited-area mesoscale), forecasting days to a few weeks ahead.Sophisticated Numeri-
cal weather prediction (NWP) systems generate an ensemble of predictions by varying the
model initial conditions and/or by varying physical parameterization schemes [?].
Classical statistical forecasting identies relationships between past observations and their
temporal successors, using observations at the current forecast origin as predictors for the
future state of the atmosphere based solely on these relationships and not on explicit mete-
orological information. Methods include conditional climatology (issuing successors of past
observational data closest to the current state as a forecast for the future state), and ap-
plications of more sophisticated multiple nonlinear regression such as neural networks [?].
Included in this category are statistical time series models that can forecast at the spatiotem-
poral resolution of rainfall measurements, and are univariate or multivariate (comprising a
vector of rainfall measurements from a number of sites simultaneously), producing density
forecasts. They are either temporally unconditional or conditional on past time steps.
Linear Statistical Models such as Autocorrelation functions, Spectral Analysis, Analysis
of cross correlations; Linear Regression and Autoregressive Integrated Moving Average
(ARIMA) have been studied for the applicability to flood forecasting.They have found
in their study that the use of stationary Autoregressive Moving Average (ARMA) as
well as non-stationary (ARIMA) versions of linear prediction techniques does not provide
accurate predictions. Application of other linear stochastic methods has also resulted in
inaccurate predictions, clearly indicating that linear statistical models do not accurately
represent historical data and hence are not acceptable methods for a non-linear application
such as flood forecasting [7].
In the auto-regressive processes where, persistence is present, that is the even outcome of
the future is dependent on the present period magnitude. The Auto Regressive Moving
Average (ARMA) processes represent a system of elements moving from one state to
another over time. Autoregressive Integrated Moving Average (ARIMA) with integration
model (also known as integration analysis) to evaluate the impact of different local, regional
and global incidents of a man-made, natural and health character, in Taiwan over the last
decade. The incidents used in this study were the Asian financial crisis starting in mid-1997,

the September 21st earthquake in 1999, the September 11th terrorist attacks in 2001, and
the outbreak of Severe Acute Respiratory Syndrome (SARS) in 2003. Empirical results
revealed that the SARS illness had a significant impact, whereas the Asian economic crisis,
the September 21st earthquake and the September 11th the terrorist attacks showed no
significant effect on air movements.[8] Akuffo and Ampaw, (2013) used ARIMA in modelling
Ghanas inflation from 1985 to 2011. Their model passed the relevant diagnostics checks
and was used to forecast inflation for the year 2012. Their model was very accurate with
predictive power of less than four (4) percent.
2.4 Generalized linear model for Rainfall variability
In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary
linear regression that allows for response variables that have error distribution models other
than a normal distribution. The GLM generalizes linear regression by allowing the linear
model to be related to the response variable via a link function and by allowing the
magnitude of the variance of each measurement to be a function of its predicted value.
In a generalized linear model (GLM), each outcome of the dependent variables, Y , is as-
sumed to be generated from a particular distribution in the exponential family, a large
range of probability distributions that includes the normal, binomial, Poisson and gamma
distributions, among others. The mean, µ, of the distribution depends on the independent
variables, X through:
E(Y ) = µ = g−1
(Xβ) . (2.3)
Where E(Y ) is the expected value of Y ; Xβ is the linear predictor, a linear combination of
unknown parameters β; g is the link function. In this framework, the variance is typically
a function, V , of the mean:
V (Y ) = V (µ) = V g−1
(Xβ) . (2.4)
It is convenient if V follows from the exponential family distribution, but it may simply
be that the variance is a function of the predicted value. The unknown parameters, β,
are typically estimated with maximum likelihood, maximum quasi-likelihood, or Bayesian
techniques. The GLM consists of three elements:
• A probability distribution from the exponential family.

• A linear predictor η = Xβ.
• A link function g such that E(Y ) = µ = g1(η).
In rainfall modeling applications, the response (rainfall occurrence or amount) is often
associated with a particular predictor in such a way that the relationship is best thought
of as between the response and some nonlinear transformation of the predictor. Examples
include the investigation of possible long-term cycles in the climate of an area (where the
fundamental predictor for any day’s rainfall is the year in which it occurs, but a cyclical
pattern implies that the relationship is really with a sine wave derived from the year), and
the realistic modeling of orographic variability (typically, the underlying predictors might be
site eastings and northings, but any structure is unlikely to be well represented by putting
these into g(µ) = Xβ directly).
2.5 Gamma Distribution
The gamma distribution is frequently used to represent precipitation because it provides a
ﬂexible representation of a variety of distribution shapes while utilizing only two parame-
ters, the shape and the scale (Wilks, 1990).The gamma distribution is a good choice for
describing precipitation values for a variety of reasons. The ﬁrst advantage of the gamma
distribution is that it is bounded on the left at zero (Thom, 1958; Wilks, 1995). This
is important for precipitation applications because negative rainfall is an impossibility, so
a distribution that excludesnegative values is readily applicable. This is especially impor-
tant in dry areas or locations with high variability but a low mean. Second, the gamma
distribution is positively skewed, meaning that it has an extended tail to the right of the
distribution. This is advantageous because it mimics actual rainfall distributions for many
areas where there is a non-zero probability of extremely high rainfall amounts, even though
the typical rainfall may not be very large (Ananthakrishnan and Soman, 1989).
Gamma distribution is a parametrized distribution of continuous function allowing for
a comprehensive analysis of the rainfall based on the acquired sample.The parameters of
the gamma probability density function are estimated for each set of monthly rainfall data.
For this study,the gamma distribution parameters are estimated through maximum like-
lihood estimation(MLE). The calculation of the MLEs begins with the calculation of an
intermediate value A referring to[] The calculation of the MLEs begins with the calculation

of an intermediate value A referring to Thom (1966) for parameters of gamma probability
distribution,α and β are obtained by
f(x, α, β) =
(x/β)α−1e−x/β
βΓ(α)
, (2.5)
and the gamma function can be written as:
Γ(α) =
∞
0
e−t
tα−1
dt (2.6)
where α and β are the shape parameter and the scale parameter respectively (2.3.1).
He found that the gamma distribution ﬁts climatological precipitation well.
In 1958, Thom (1958) began with the calculation of maximum likelihood estimation,
where xi is equal to all positive values in the rainfall history, and the mean (¯x) is the
arithmetic mean of all positive values. The value A is equal to the natural log of the mean
minus the mean of the natural logs of the positive accumulations at a point. This value is
then used in the estimation of the shape parameter, represented by α in (2.3.3). The scale
estimator, β is the mean divided by the estimated shape parameter in (2.3.4). Therefore
the product of the shape α and the square of the scale β is approximately equal to the
variance (S2), given as the mean of the sum of squared diﬀerence from the mean (2.3.5).
A = ln(¯x) −
n
i=1 ln(xi)
n
(2.7)
where n is the number of rainfall observations.
ˆα =
1
4A
(1 + 1 +
4A
3
) (2.8)
ˆβ =
¯x
ˆα
(2.9)
S2
= ˆαˆβ2
(2.10)
Equation (2.3.1) can be written to show that the product of the parameter estimates is
equal to the mean of the positive values in the rainfall history. The rewriting of (2.3.4) and
(2.3.5) can aid in providing some intuitive understanding of the rainfall distribution at any
point (Husak et al., 2006).

Chapter 3
METHODOLOGY
3.1 Data Source and Sample size
Monthly rainfall datasets for KIGALI-AERO station in KIGALI City have been used in
this study. The data were made available by the RMA (Rwanda meteorology Agency)
databank; it consists of monthly station Rainfall data from 1971 to 2014 of KIGALI-AERO
station of diﬀering lengths and completeness of record.However a portion of (0.95%) of the
whole data from 1994 is missing,from may to septmber due to the Genocide against Tutsi
in 1994 where most of meteorological stations of Rwanda were unattended for many days.
3.2 Time Series Method for Data Analysis
Time series modeling of time spaced events will be formulated to understand the generation
mechanisms of events, forecasting or predicting for future events and also for optimal control
of events. Generalized linear model (GLM) will also be used to test for the speculation of
changing in rainfall patterns and to quantify their structure, GLM approach provide a
powerful tool for interpreting historical rainfall records.The data will be subdivided into
annual cycle of four seasons which are SON (September, October and November), DJF
(December, January and February), MAM (March, April and May)and JJA(June, July
and August).
15

CHAPTER 3. METHODOLOGY 16
3.2.1 Box-Jenkins Algorithm
This approach will use rainfall data in the past to provide forecasts. Using the ARMA
self-projecting time series forecasting model, I hope to find a mathematical formula that
will approximately generate the historical patterns in a time series. The self-projecting time
series uses only the time series data of the activity to be used to generate forecasts. The
Box Jenkins methodology seeks to transform any time series data to be stationary; and
then apply the stationary process for forecasting by using past univariate time series process
for future forecast with a host of selection and diagnostics tools. The process involves some
three basic steps as discuss below.
Model Identification
This stage involves the specification of the correct order of ARIMA model by determining
the appropriate order of the AR, MA and the integrated parts or the differencing order. The
major tools in the identification process are the (sample) autocorrelation function (ACF)
and partial autocorrelation function (PACF) [8].The identification approach is basically
designed for both stationary and non-stationary processes.
According to the Table below, spikes represent the line at various lags in the plot with
length equal to magnitude of autocorrelations and these spikes distinguish the identification
of a stationary and non-stationary process.
For stationary series if the series is a moving average model with order q MA (q), then this
can be identified by an autocorrelation function (ACF) with zero at lags greater than q;
and partial autocorrelations (PACF) tail off in exponential fashion. However, for an AR (p)
process the partial autocorrelation (PACF) is zero at lags greater than p, autocorrelations
(ACF) tail off in exponential fashion. Also in an ARMA processes, both the ACF and
PACF will have large values up to q and p respectively, which tail off in an exponential
fashion.
The complete framework for the identification is as shown in the Table below
Model Estimation
Depending on the ACF and PACF of the sequence plots a model is run with SPSS software.
The best fitting model must also have few parameters as much as possible alongside best

statistics of the model according to the information selection criteria.
Model Checking
Model checking in time series can be done by looking at the residuals. Traditionally the
residuals given by: residual=observed values- fitted values These residuals should be
normally distributed with zero mean, uncorrelated, and should have minimum variance or
dispersion, if indeed a model fits the data well. That is model validation usually consist of
plotting residuals over time to verify the validation. A comprehensive procedure includes:
• Plot the residuals against time and inspect increasing (decreasing) variations which
may suggest the need for data transformation.
• Plot ACF and PACF of residuals.
• Plot residuals against fitted values and check for variations and correlation
• Check the various t ratio parameter estimates if any term(s) need to be dropped from
the model.
• Check the correlograms derived from the residuals to determine whether additional
terms are required.
Residual analysis can also be done through formal test using the Portmanteau test and
other statistical tests.

Figure 3.1: Diagram of Box-Jenkins Modeling Approach
3.3 Time Series Model Type
3.3.1 Autoregressive Moving Average ARMA (p, q) model
The general form of this model is the ARMA (p, q) model, and is a combination of an
AR (p) and a MA (q) model. The most frequently used model is the ARMA (1, 1) model,
defined as
Xt = c + εt +
P
i=1
ϕiXt−i +
Q
i=1
θiεt−i, (3.1)
Where ϕ1, . . . , ϕp and θ1, . . . , θq are parameters of the model, c is a constant, and the
random variable θt is white noise error term. ARIMA time series which is made stationary
by differencing process is known as Integrated Autoregressive Moving Average (ARIMA)
model. ARIMA model is represented by three parameters: p order of autoregressive model,
d order of differencing, and q order of moving average model. ARIMA model takes histor-
ical data and decomposes these data into an autoregressive (AR) process which maintains
memory of past events, an Integrated (I) process which makes data stationary for easy
forecast and a Moving Average (MA) process of forecast errors. It does not suffer from

existence of serial correlation between the error residuals and their own lagged values. An
ARIMA (p,d,q) model can be checked if it is a good statistical fit for data or not, using
Akaike Information Criterion (AIC) and Schwarz Criterion (SC) method. Autocorrelation
(AC) and partial autocorrelation (PAC) statistics help to determine the right parameters
for ARIMA model.
3.4 Generalized linear model for Rainfall
Generalized linear models (GLMs) extend the classical linear regression model, and are
well established in the statistical literature.The fundamental idea is to predict a probability
distribution for each Monthly’s rainfall at KIGALI CITY, by relating the mean of that
distribution to the values of various other related quantities ”covariates“. Possible covariates
include previous Monthlys rainfalls. Formally, a GLM for a n×1 vector of random variables
Y = (Y1, . . . , Yn), each dependent on p predictors (whose values can be assembled into a
n × p matrix X whose (i, j)th element is the value of the jth predictor for Yi), consists of
specifying a probability distribution for Y, with vector mean µ = (µ1, . . . , µn), such that
g(µ) = xβ (3.2)
Here, g(µ) is a monotonic function (the link function) and is a p ∗ 1 vector of coefficients
(by g(µ) we mean the n1 vector whose ith element is given by g(µi)). Model (6) is a
natural extension of the simple linear regression model. A constant term in the model can
be defined by including a column of 1s in the matrix X. When, as here, the Ys arise as
one or more time series and we wish to include previous values of the series as predictors,
we are implicitly studying the conditional distributions of each Y given the past, and the
usual GLM methodology carries over straightforwardly [9]. The implementation broadly
will follow that of Coe and Stern [10] who adopted a two-stage approach as follows:
• For stage 1 (occurrence model), Model the pattern of wet and dry days at a site using
logistic regression. Let Pi denote the probability of rain for the ith case in the data
set, conditional on a covariate vector Xi; then the model is given by
ln(
pi
1 − pi
) = Xiβ (3.3)
For some coefficient vectorβ.

• For stage 2 (amounts model),fit gamma distributions to the amount of rain on wet
days. The rainfall amount for the ith wet day in the database is taken, conditional on
a covariate vector εi, to have a gamma distribution with mean µi, where
ln(µi) = εiγ (3.4)
For some coefficient vector γ, all gamma distributions are assumed to have a common shape
parameter ν , say (if ν = 1 the distributions are exponential). This is equivalent to assum-
ing that, conditional on the covariates, daily rainfall values have a constant coefficient of
variation [11].These two models are referred to us occurrence and amounts models respec-
tively. The right-hand sides of stage (1) and (2) are called linear predictors. In the GLM
(Generalized Linear Model) framework, model fitting (estimation of the coefficient vectors β
and γ) and selection can be carried out using likelihood methods. Models can be checked
using a variety of simple but informative residual plots. Further features include the ability
to model interactions between covariates (two covariates are said to interact if the effect of
one of them depends upon the value of the other), and the estimation of non-linear trans-
formations of covariates. Interactions can yield useful information about the mechanisms
driving the rainfall process. Models (3.4) and (3.4) specify probability distributions for
monthly rainfall conditioned on the values of various covariates such as previous rainfalls,
time of year and external factors.

Chapter 4
RESULTS AND ANALYSIS
4.1 Data presentation
The presentation of data is vividly important part as it is a process of data analysis and
report writing. Although results can be expressed within the text of a report, data are
usually more digestible if they are presented in the form of a table or graphical diplay.The
below graph of average yearly rainfall data will convey directcly the reader to the essential
points or trends in the data. Too much data make doesn’t enable to clearly visualize the
variations of monthly data in diﬀerent years ,however there are ofcourse some sorts of trends
and seasonality as the climate of Rwanda is divided into three agro-climatic zones which
are a high-altitude region, central plateau, and plateau of eastern lowlands
4.1.1 Preliminary Analysis
The entire data span from 1971-2014, was too much to make an eﬀective study and to draw
a best conclusion ,they had the missing values which normarlly was in a portion said in
section 4.1 , Therefore we pretended to use the yearly average data to build an appropriate
time series model
From table, there was a minimum rainfall of 0.00 mm, and maximum rainfall of 324.30
mm recorded in KIGALI CITY. A mean rainfall and a standard deviation of 3649.006 mm,
and 60.40701 mm, respectively were recorded in KIGALI CITY for the same period. These
vividly show an uneven distribution of rainfall amongst the months for the period January,
1971 December,2014.The value of the standard deviation of 60.40701 mm shows that there
21

CHAPTER 4. RESULTS AND ANALYSIS 22
was great dispersion of rainfall pattern amongst the months and years under study. That is
rainfall was relatively high in some months, and in some years, and relatively low in some
months showing a wide dispersion that may lead to non stationarity at level of rainfall
ﬁgures in the city.
The summary statistics and time plots of monthly rainfall data was examined to check
for the stability of the data.The time plots with the summary statistics shows an indication
that the data are not stationary at levels, exhibiting some seasonal behaviour and this must
be formally tested. Before the data analysis, we edit and clean the data to avoid errors and
omissions in the collected data as there were a portion of missing data I use the appropriate
techniques to catch the missing values.
4.2 Fitting the appropriate time series model of yaerly aver-
age data
The yearly average data has been studied to identify and build an adequate time series
model
4.2.1 Identifying potential model
The identiﬁcation of potential model is based on patterns of the autocorrelation and partial
autocorrelation functions (ACF,PACF). These are plots of the autocorrelations and partial

auto correlations at various lags, against the size of the lag. Thus in the autocorrelation
plot, the size of the autocorrelation is more or less equal to the size of the data minus
2. The specification of the correct order of ARIMA model is obtained by determining the
appropriate order of the AR (autoregressive), MA (moving average) and the integrated
parts or the differencing order.The identification approach is basically designed for both
stationary and non-stationary processes
Figure 4.1: plots of autocorrelation of yearly average
The above plots of autocorrelation function(ACF) and partial autocorrelation (PACF)represent
a significant spike cut-off at lag one.The order of AR (autoregressive)and MA (moving av-
erage) can be specified as AR(1)and MA(1)and for the fact that the plots of autocorrelation
function(ACF) and partial autocorrelation (PACF) are certainly ones in the factors to dis-
tinguish whether a given serie is stationary or not. Therefore, from the plots, yearly average
data serie is not stationary and by stationarity we mean a serie where there is no systematic

change in mean (no trend), no systematic change in variance and if strictly periodic varia-
tions have been removed. This can also be viewed from the the plot of these data with time
From the above figure the serie shows some sorts of variations in mean and variance, hence
Figure 4.2: plots of yearly average data with time
the serie can be concluded to be not stationary in variance and mean as well, we need to
apply the appropriate transformations to make it stationary, the serie was differenced with
difference of order 2 (d=2) to obtain a stationary serie with no systematic change in mean
and variance as the plot was for average yearly data there were no periodic varitions(i.e:there
is no need of seasonal decomposition). From the above ACF and PACF correlograms The
presence of damped oscillation were noticed in the figure, and this was an evidence of both
AR, and MA parameters in the optimal model. , both the sample ACF and PACF show
a large positive value at lag 1, and much smaller values at higher lags. This suggests two
combined models i.e an AR (1) and MA (1) models. Thus, since the order of differencing is
2, the potential model for the original data is ARIMA (1,2,1)The identified model become
ARIMA(1,2,1)
4.2.2 Model fitting
Once a potential model have been identified, the second step is to fit the model, but here
the model solution is not as simple as in regression. It requires an iterative process and a
model must also have few parameters as much as possible alongside best statistics.
In model fitting the principle of parsimony is generally applied and this is a rule to seek

simplest models as much as possible . For example in time series, if neither AR (p) nor
MA (q) models are plausible, it is natural to try ARMA (p, q). And in accordance with
the principle of parsimony, to use as small as p and q as possible, starting therefore with
p = q = 1. The use of statistical packages such as SPSS helped to find the best estimates
Table 4.1: ARIMA Model parameters
Estimator Standard Error t-statistics Significance level
Constant -0.30 0.140 -0.215 0.831
AR -0.212 0.175 -1.210 0.233
MA 0.991 1.220 0.813 0.421
The ARIMA (1,2,1) with p=q=1 and d=2 (autoregressive component, two non-difference
and moving average component) is given as:
Yt = µ + α Yt−1 − θ t−1 + t, (4.1)
Substituting, the parameters with non-difference into, the model is derived as:
˜Yt = −0.30 − 0.212Yt−1 + 0.991 t−1 + t, (4.2)
4.2.3 model cheking
One advantage of the Box-Jenkins methods is that it is possible to formally check whether
a model is adequate or not. If a model is adequate:-
• Firstly, the residuals should not show any patterns
• Secondly, the residuals should also not be serially correlated
Thus we can check for patterns by plotting the residuals versus time. We can also check
for autocorrelation by examining the ACF and PACF of the residuals. Some problems can
arise in testing these plots, because partly due to so many lags being tested simultaneously,
partly due to the fact the estimates of the correlations are themselves correlate. For this,
a portmanteau or overall test has been developed, which is known as the Box-Pierce
statistic or in a refined version the Box-Ljung statistic or as the chi-squared test. These
statistics test whether the auto correlations at the first k lags are in accord with the null

hypothesis that they are all zero, i.e, are consistent with the residuals forming a random
process. Rejecting this hypothesis indicates that the model is not adequate[4]. The below
ﬁgures represents the correlograms of ACF and PACF
Figure 4.3: The correlogram of ACF and PACF
From the correlogram, we can clearly see that the spikes lie almost in between conﬁdence
interval since we only have correlation at lag 2 but we ignore it since it is a weak correlation
less than 0.5
4.3 RESULTS AND INTERPRETATIONS

Figure 4.4: plots of yearly average data with time

Chapter 5
DISCUSSION, CONCLUSION
AND RECOMMENDATION
5.1 DISCUSSION
This part presents the discussion of the main results of this project and comparisons of the
results done by other researchers discussed in the literature review part.
5.2 CONCLUSION
5.3 RECOMMENDATION
28

Bibliography
[1] Munyaneza O, Uhlenbrook S, Maskey S, Wali UG, Wenninger J. Hydrological and cli-
matic data availability and preliminary analysis in Rwanda. InProceedings of the Hy-
drology, 10th International WATERNET/WARFSA/GWP-SA Symposium, Entebbe,
Uganda 2009 Oct (pp. 28-30).
[2] An Eﬀective Hybrid Semi-Parametric Regression Strategy for Rainfall Forecasting
Combining Linear and Nonlinear Regression. Jiansheng Wu, Wuhan and Liuzhou
Teachers College, China. 4, s.l. : IGI Publishing Hershey, PA, USA, October 2011,
Vol. Volume 2 . ISSN: 1942-3594 EISSN: 1942-3608 doi¿10.4018/jaec.2011100104.
[3] Htike KK, Khalifa OO. Rainfall forecasting models using focused time-delay neural
networks. InComputer and Communication Engineering (ICCCE), 2010 International
Conference on 2010 May 11 (pp. 1-6). IEEE.
[4] Gasana, Emelyne. Time series and forecasting lecturer notes. College of science and
technology : s.n., 2015-2016.
[5] Upton MH, Miller T, Chiang TC. Upton et al. Reply. Physical Review Letters. 2005
Feb 23;94(7):079702.
[6] Battan LJ. Fundamentals of meteorology. Englewood Cliﬀs: Prentice-Hall, 1984, 2nd
ed.. 1984;1.
[7] Moura AD, Hastenrath S. Climate prediction for Brazil’s Nordeste: Performance of
empirical and numerical modeling methods. Journal of Climate. 2004 Jul;17(13):2667-
72.
29

BIBLIOGRAPHY 30
[8] Ampaw EM, Akuﬀo B, Larbi SO, Lartey S. Time Series Modelling of Rainfall in New
Juaben Municipality of the Eastern Region of Ghana. International Journal of Business
and Social Science. 2013 Jul 1;4(8).
[9] Fahrmeir L, Tutz G. Multivariate statistical modelling based on generalized linear
models. Springer Science& Business Media; 2013 Mar 14.
[10] Stern RD, Coe R. A model ﬁtting analysis of daily rainfall data. Journal of the Royal
Statistical Society. Series A (General). 1984 Jan 1:1-34.
[11] McCullagh P, Nelder JA. Generalized linear models. CRC press; 1989 Aug 1.
[12] H. Thom. A note on the gamma distribution. Monthly Weather Review, 86(4), 1958.

proposal

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (10)

Ähnlich wie proposal

Ähnlich wie proposal (20)

proposal