6.benchmarking of

Article Title Page

Benchmarking of Marine Bunker Fuel Suppliers: The Good, The Bad, The Ugly

Author Details
Author 1 Name: Ole Jørgen Anfindsen
University/Institution: DNV Research & Innovation
Town/City: Høvik
Country: Norway

Author 2 Name: Grunde Løvoll
Town/City: Høvik
Country: Norway

Author 3 Name: Thomas Mestl
Town/City: Høvik
Country: Norway

Corresponding author: Ole Jørgen Anfindsen
Corresponding Author’s Email: ole.jorgen.anfindsen@dnv.com

Acknowledgments (if applicable): n/a

Biographical Details (if applicable): Ole Anfindsen holds a dr. scient. degree (PhD) in computer science and a bachelors degree
in electronics engineering. For more than 25 years he has worked with databases and related technologies. He has been senior
research scientist in Telenor R&D, visiting researcher at GTE Laboratories (Massachusetts) and Sun Microsystems Laboratories
(California), as well as adjunct associate professor at the Institute of Informatics at the University of Oslo. He currently works as a
researcher in the Research & Innovation department of DNV, where his main activity is directed towards data analysis especially in
the maritime area.
G. Løvoll has a dr. scient. degree (PhD) in physics. Grunde has worked for 6 years as a Post Doc and researcher at the
Department of Physics at the University of Oslo doing experimental studies on multiphase flow in porous materials, water diffusion in
dry clay and optical tweezers. Dr. Løvoll currently works as a researcher in DNV Research & Innovation, where his main focus is on
data analysis in the maritime area.
Thomas Mestl has a Dr. Scient. (PhD) in mathematics and a degree in precisions engineering. He has worked in DNV's Research
Department for the last 13 years within the field of information technology. A large part of his work has been on identifying emerging
technology trends, evaluating new ICT technologies (especially with respect to mobile work and information management), and to
identify promising business opportunities offered by new or combination of existing technologies. Currently, his main activity is
directed towards data analysis especially in the maritime area.

Structured Abstract: Purpose - This paper has two main focus areas; the construction of a realistic best practice benchmark, and
the development of a methodology for comparison of individual suppliers of marine bunker fuel. As is well-known in this trade, unfair
business behaviors in the bunker fuel market are not uncommon, resulting in financial losses for the buyers.
Design/methodology/approach - Establishing a best practice will naturally involve some degree of subjectivity as there is not a
priori correct answer to this problem. Using the concept of membership functions from fuzzy set theory, a score can be derived from
a best practice benchmark histogram. The main advantages of this method are its relative independence both of sample size and of
the underlying distribution, as well as being computationally very efficient.

Findings - Our methodology turns out to be more powerful than standard descriptive statistics, as it is less sensitive to outliers and
is well suited for small datasets and even single numbers. When applied to data for all suppliers worldwide it turns out that the
number of good suppliers is actually much lower than might be expected.

Practical implications - Bunker fuel is a major expense for ship owners, and can easily reach $30 million/year
for a single container ship. There is therefore a considerable interest in the market for benchmarking of individual
fuel suppliers. Our methodology is also applicable to other quality related fuel parameters.

Originality/value - To the best of our knowledge this is the first attempt to benchmark actors in the marine bunker
fuel industry and to quantify their behaviors.
Keywords: benchmarking, membership functions, scoring, fuzzy clustering, supplier quality, best practice

Type header information here

Article Classification: Technical paper

For internal production use only

Running Heads:

Type footer information here

Benchmarking of Marine Bunker Fuel Suppliers:
The Good, The Bad, The Ugly

Abstract
Purpose
This paper has two main focus areas; the construction of a realistic best practice benchmark, and the
development of a methodology for comparison of individual suppliers of marine bunker fuel. As is well-known
in this trade, unfair business behaviors in the bunker fuel market are not uncommon, resulting in financial losses
for the buyers.

Design/methodology/approach
Establishing a best practice will naturally involve some degree of subjectivity as there is no a priori correct
answer to this problem. Using the concept of membership functions from fuzzy set theory, a score can be derived
from a best practice benchmark histogram. The main advantages of this method are it’s relative independence
both of sample size and of the underlying distribution, as well as being computationally very efficient.

Findings
Our methodology turns out to be more powerful than standard descriptive statistics, as it is less sensitive to
outliers and is well suited for small datasets and even single numbers. When applied to data for all suppliers
worldwide it turns out that the number of good suppliers is actually much lower than what might be expected.

Practical implications
Bunker fuel is a major expense for ship owners, and can easily reach $30 million/year for a single container ship.
There is therefore a considerable interest in the market for benchmarking of individual fuel suppliers. Our
methodology is also applicable to other quality related fuel parameters.

Originality/value
To the best of our knowledge this is the first attempt to benchmark actors in the marine bunker fuel industry and
to quantify their behaviors.

Keywords: benchmarking, membership functions, scoring, fuzzy clustering, supplier quality,
best practice
Category: Technical Paper

1. Introduction
The density of marine bunker fuel can be regarded as one of its most basic parameters. It is used for
fuel quantity estimation, and is also the basis for the so-called Calculated Carbon Aromaticity Index
(CCAI), an important factor for ignition and for deposits in the engine and used for calculating the
specific energy content in fuel. Density is also an important factor when it comes to the process of
separating water or solids from bunker fuel.
For the typical ship operator the primary importance of density comes from the fact that bunker fuel is
delivered by volume but paid per ton. The conversion is done by means of the fuel density reported by
the supplier. A small density difference between stated and actual fuel density can quickly lead to
large financial losses for the ship operator. For instance, if a density of 977 kg/m3 is stated when the
actual value happens to be 960 kg/m3, this will give rise to a difference of nearly 35 ton when

p. 1

bunkering 2000m3, the value of which, in the current market, is close to US$ 20,000 – just for a single
bunkering.
Although this example belongs in the high end of the spectrum, it is not at all hard to find even more
extreme examples in real life. And such a way of making a quick buck is exploited by many fuel
suppliers as their stated density is usually used to calculate the quantity of the delivered fuel. Over-
reporting of density, i.e. claiming that the fuel density is higher than what is actually the case, is called
short-lifting, while the opposite could be termed long-lifting. Short-lifting implies that the ship
operator loses money, since he pays for more fuel than he receives. Long-lifting implies that the fuel
supplier loses money, and that the ship operator gets more than what he pays for.
The global market for marine bunker fuel is more than 300 million tons annually (IEA 2010, p. 618;
Eyring et al 2010; IMO 2009; EPA 2008). We estimate that more than 300,000 tons of bunker fuel, i.e.
about 1‰ of the global consumption, is short-lifted every year. We further estimate that the amount of
long-lifting exceeds 150,000 tons. That is, on the order of half a million tons are long- or short-lifted
annually. Thus, bunker fuel worth more than US$200 million appears not to be properly accounted for
every year.
Both short- and long-lifting may be indications of fraudulent behavior of individual employees within
the ship operator’s or bunker fuel supplier’s organization. Such behavior is however sufficiently
widespread that a systematic and commonly accepted short-lifting praxis in parts of the bunker fuel
trade may be suspected. Some fuel suppliers use this tactic to consistently over-state the delivered
amount to improve the company’s profit margin. Many ship operators and suppliers would welcome a
benchmarking of suppliers, ports, or geo-regions against some best practice.
The rest of the paper is organized as follows: In Section 2 we take a closer look at concrete examples
of different density reporting strategies and discuss the difficulties associated with single number
characteristics. In Section 3 we use this to characterize good suppliers and derive criteria for defining a
best practice. In Section 4, a Best Practice Classifier is constructed that will assign a Best Practice
Score to an individual bunkering or a supplier. We also present a series of benchmarking comparisons
between regions together with an overview of how they developed over a 10 year period. This paper
ends with a discussion and some promising leads for further work.

2. Investigating density reporting behavior
Table 1 gives some statistics for density deviations on a global and local basis (e.g. Canada and the US
West coast, South Asia, Middle East, and South America West) and for 4 selected suppliers (S1, S2 , S3,
S4) in 4 different bunker ports. The density difference, dd, is the difference between the density
claimed by the supplier and the actual density measured by a fuel testing agency (e.g. DNVPS). The
average density difference, dd , could in principle be used to characterize the behavior of a fuel
supplier (a port or a region) as good, medium or bad.
Unfortunately, most of such single number quality measures have some sort of shortcoming as they
compress a wealth of information into a single number. They often wipe out (quite effectively) much
of the information about the interesting behavior of a supplier. In addition, the arithmetic mean or
median may be less suited for distributions that are non-normal, skewed or showing heavy tails. Also,
the mean and standard deviation is very sensitive to outliers (a few unusually large or small
observations) (Bhattacharyya & Johnson 1977). As an example, the mean value of ten bad bunkerings
could easily be balanced by one exceptionally good one (or a typing error), while the median is less
sensitive to outliers. Another problem with the mean and median is that they reveal nothing about the
shape of the underlying distribution. For instance, if we only look at the mean, the geo-region South
America West seems to be better than e.g. Canada & US West Coast from a short-lifting perspective,
see Table 1. If we take the standard deviation into account it is obvious that there is a higher risk of
being short-lifted in South America West than in the other geo-regions, simply because the
distribution is wider. The standard deviation only refers to the width of the underlying distribution but
not to the actual shape. As can be seen in Figure 2 the distributions are non-normal, i.e. a highly
skewed middle spike combined with a very long one-sided tail.

p. 2

Table 1: Standard descriptive measures of density differences for some selected geo-regions and suppliers
(n = number of samples, dd = mean density difference, σdd = standard deviation of dd). Histograms for the
geo-regions and suppliers are shown in Figures 1 and 2 respectively, whereas their scatter plots are shown
in Figures 3 and 4. Data in this table and in the following examples is, unless otherwise stated, based on
DNVPS bunkering samples of RMG380 fuel collected in 2008 (confer DNV 2010).
dd median(dd)
n
in kg/m3
σdd
in kg/m3
Global 43343 0.39 0.10 3.92
Canada & US West Coast 1919 0.03 -0.10 2.43
South Asia 6806 1.22 0.90 3.35
Middle east 2990 1.83 0.70 4.76
South America West 565 -0.48 -0.90 6.00
Supplier 1 (S1) 129 -0.12 -0.10 0.95
Supplier 2 (S2) 239 2.31 0.90 4.84
Supplier 3 (S3) 71 2.40 2.60 1.83
Supplier 4 (S4) 145 2.07 1.50 2.81

Histograms
For a more detailed understanding of the properties of the data in Table 1 please refer to the density
difference histograms of Figures 1 and 2. For comparison we have plotted a smoothed version of the
global histogram (dashed line) and a smoothed version of the actual histogram (solid line). These
histograms represent estimates for the underlying probability density distribution and can thus tell us
something about the risk and possible amount of the short-lifting. A comparison with a reference
histogram, like the global histogram, would provide the desired benchmark.
From Figure 1 it can be seen that none of the histograms seem to come from a normal distribution (the
implications of this observation will not be further discussed in this paper). This can be confirmed by
means of a probability plot. The different geo-regions also show significant differences in their density
reporting practice. Canada & US West Coast appears better than the global average, the peak of the
histogram is centered at 0 and has shorter tails. For South Asia, the width of the histogram is similar to
the global one, but its center is shifted towards short-lifting, whereas the Middle East shows a fairly
heavy short-lifting tail. The histogram for South America West is especially remarkable as the chance
of actually getting the fuel density stated by the supplier appears to be slim. The rule is rather that the
buyer is either short- or long-lifted, something which could not be deduced from the standard
descriptive statistics.

Figure 1: Probability distribution of density reporting deviations (i.e. the difference between claimed and
measured density) for 4 selected geo-regions. The histograms are (clockwise from top left): Canada & US
West Coast, South Asia, Middle East, and South America West Coast. The solid lines represent the
smoothed histogram while the dashed lines are the smoothed global histogram. The underlying number of
samples, averages, medians, and standard deviations are given in Table 1. The histograms reveal
considerable variation in density reporting.

Histograms for individual suppliers listed in Table 1 are shown in Figure 2 below. A visual
comparison indicates that Supplier 1 is much better than the global average with a narrow symmetric
distribution centered at 0. The three other suppliers are all heavily short-lifting with varying degrees of
right-shifted and/or right-heavy distributions. Based on these histograms the suppliers might be
characterized as rather bad, but any fine grained information about their underlying reporting strategy
is removed by the histogram. A main disadvantage of using histograms for characterizing suppliers is
that they require a considerable amount of data which could be a challenge when considering short
time periods or suppliers with few data samples.

p. 3

Figure 2: Probability distribution of density reporting deviations (i.e. the difference between claimed and
measured density) for 4 selected suppliers in 4 different bunker ports (for more details se Table 1). The
histograms reveal different reporting behavior, but histograms become noisy when the number of samples
becomes too low.

Scatter plots
Scatter plots of measured vs. claimed density allows a much more fine grained view on the underlying
data. These plots may be used to unravel the various reporting strategies of the suppliers, see Figure 3
and Figure 4. Scatter plots quite effectively visualize the density reporting behavior of suppliers or
groups of suppliers. Note that each dot in a scatter-plot represents at least one bunkering sample. The
diagonal solid line represents correct density reporting (i.e. stated = measured, in the following called
no-cheat line). The horizontal and vertical dashed lines specify the upper density limit given by the
ISO8217 standard.
These scatter plots exhibit some interesting observations. Note that the range of densities of the
available fuel varies between geo-regions; e.g. the fuel density range is much wider in the Middle East
than in North America or South Asia. This phenomenon may be traced back to the proximity to crude
oil production in the regions.
Observe also that in many bunkerings the fuel density was above the limit (dots to the right of vertical
dashed line) but almost none of them were reported to lie above the limit (above horizontal dashed
line). This is true for all suppliers.
From Figure 4 we may deduce that Supplier 1 could be considered as rather good, since most of his
samples are on or close to the no-cheat line. This behavior seems to be dominant for most of the
suppliers in the Canada & US West geo-region (note: good suppliers are found in all geo-regions). In
contrast, Supplier 2 may be regarded as bad, since his stated densities cover the whole range from the
no-cheat line and all the way up to maximum-cheating, i.e. the upper density limit given by the
standard. This type of behavior is also visible both in the South Asia and the Middle East scatter plots.
It seems that Supplier 3 has a strategy of simply adding an offset to the real density, which is reflected
in the mean density different from zero and a relative low standard deviation. A fourth reporting
scheme appears in Supplier 4 who has a tendency of always stating a density near the limit –
independently of the actual density. This could be termed as the worst behavior since they short-lift as
much as possible. This behavior is not uncommon in South Asia and the Middle East. Variations to
this scheme, i.e. stating a fixed fuel density but lower than the limit, are seen in Asia, Middle East and
South America West. They appear as horizontal lines in the scatter plot.

Figure 3: Scatter plot of measured vs. claimed density for the same geo-regions as in Table 1 and Figure 1.
Each black dot represents (at least) one bunkering. The solid line represents the no-cheat line, i.e.
bunkerings where the supplier states the density correctly (claimed = measured), whereas the dashed lines
indicate the upper density limit in the ISO standard for bunker fuel (ISO8217), viz. 991 kg/m3, implicitly
giving the maximum possible amount of cheating. Many dots along the upper dashed line indicate a high
degree of cheating in many bunkerings. Note that in many bunkerings the fuel density was above the limit
(dots to the right of vertical dashed line) but almost none of them were reported to lie above the limit
(above horizontal dashed line).

Figure 4: Scatter plot of measured versus claimed density for the same suppliers as in Table 1 and Figure
2. Supplier 1 reports quite honestly as his dots are scattered close along the no-cheat line. In contrast,
Supplier 2 and 3 have many reportings away from this no-cheat line but they are not as dishonest as
Supplier 4, who basically reports only one density close to 991 irrespective of the actual fuel density.

p. 4

3. The Good: Best practice benchmark
The above discussion has emphasized the need for a good benchmark for measuring the goodness in
density reporting, and for distinguishing between various short-lifting and long-lifting strategies.
The scatter plots of Canada & US West Coast and Supplier 1 are examples of good density reporting
behaviors that could be used as best practice references. Our interpretation of good or best practice is
indicated by the grey diagonal area around the no-cheat line in Figure 5. Fair reporting and good
control of the delivered density should result in a small symmetric scatter around the no-cheat line,
and thus a narrow density difference (dd) histogram centered at dd = 0 (like the one for Supplier 1 in
Figure 2).
The goal is to establish a best practice, and then use it as a predefined reference to which bunkerings
may be compared. This best practice benchmark is given by the dd-histogram for a group of selected
good suppliers.

Figure 5: Scatter plot of bunkering data from South Asia. Data points around the diagonal line (no-cheat
line) indicates good or best practice behavior, i.e. fair reporting, with little or no cheating. In the area
above the no-cheat line, customers get short-lifted (pay too much) whereas below the line the supplier loses
money. The more dots there are above the fair line, and the further away from it they are, the less
accurate the density reporting. Bunkerings far below the fair area should be considered suspicious and
may indicate a bribing situation. Reportings in the grey horizontal area (reporting densities close to the
upper density limit) indicate that some suppliers consciously choose a strategy of maximum density
cheating. A close up of the scatter plot near the density limit = 991 kg/m3 reveals that hardly any suppliers
are willing to state that their fuel exceeds the limit even when this is clearly the case.

This best practice histogram shall represent good suppliers and should be based on many data points.
Any outliers, intentional cheating, or other indications of dishonesty should be eliminated to obtain an
unbiased and fair benchmark. The following criteria for deriving the best practice benchmark should
therefore be chosen (there will always be a certain element of subjective judgment in this process, but
the method for deriving the benchmark should as far as possible be transparent, sound, and unbiased):
1) Select some geo-regions where the scatter plots show that data are predominantly found along
the no cheat line.
2) For each selected dataset we:
a. Eliminate extreme outliers, max cheating and near limit lying; only data inside a
predefined area around the no-cheat line is selected (see Figure 6 for details).
b. Eliminate any bias by centering the dd data around dd = 0.
3) The adjusted and selected dd data for all the selected sets are then merged into one large
dataset.
4) Calculate the dd histogram for the dataset.
Figure 7 shows the best practice reference histogram derived from the geo-regions Biscay, Canada &
US East Coast, Canada & US West Coast, US Gulf Coast, and Oceania.

Figure 6: Only bunkering samples between the 2 blue solid lines will be used as basis for deriving the best
practice benchmark histogram. This effectively eliminates max cheating, outliers, and ‘near limit effects’,
i.e. less than complete honesty when selling too heavy fuel. The upper solid line divides the angle between
no-cheat and max-cheat lines. The lower solid line is simply mirrored around the no-cheat line such that
the density deviations are the same above and below, i.e. |+ | = |- |.

p. 5

Figure 7: Best practice dd histogram based on samples from selected geo-regions (Biscay, Canada
& US East and West Coast, US Gulf Coast and Oceania) where max cheating, outliers and near
limit dishonesty have been eliminated. The dashed line is the histogram function H, i.e. a
smoothed version of the histogram indicating the global best practice.

Classification by membership function
Once the best practice histogram is generated, the challenge is to benchmark a supplier, a port, or a
region against it. In principle, this histogram must be compared with the dd histograms for the
suppliers in question and the degree of conformance would then give the desired benchmark.
Unfortunately this is a non-trivial task and for many of the suppliers only relatively few samples are
available, resulting in bad histograms. We therefore propose a more elegant approach that is
insensitive to the number of data points and outliers, and that can even be used for a single bunkering.
The concept of a membership function (Turksen 1991; Terano et al 1987, p. 21), which is widely
applied in Fuzzy set theory (Lowen 1996, Self 1990), is used to achieve this benchmarking. A single
number (score) is computed denoting the goodness of a specific bunkering or supplier.
An example will hopefully make this clear. Consider the task of benchmarking people into fast and
slow runners, respectively. One way to do this is to set a threshold T on how fast a person should be
able to run 100 m, and then categorize the people who run slower than the threshold as slow (=0) and
those who run faster than the threshold as fast (=1). This sorting is achieved by a Boolean membership
function B with threshold T for the measured time t on 100 m, i.e. B(T,t). However, it is quite obvious
that this benchmarking will result in a crude oversimplification as there is a continuous transition from
extremely fast runners to the really slow ones, and a small change in the chosen threshold could
seriously alter the number of members in each category. A better approach would be to replace the
Boolean function with a continuous function, assigning a continuous membership value between 0 and
1 depending on how fast they run. This is an example of a so-called membership function, and will in
the following simply be denoted m.
The situation is analogous to our best practice density benchmark where suppliers (or bunkerings) are
not grouped into crisp sets of good and bad but rather get a score indicating how close to or far away
from the best practice they are. This, by the way, is also the reason why e.g. discriminant analysis
(Hastie et al 2009) is unsuitable for the task at hand.
The challenge is to find a membership function for the good group, faithfully reflecting what we
consider to be good. Fuzzy set theory does not provide help in determining the membership function,
as all kinds of functions are used, e.g. triangular, trapezoid, Gaussian, etc. The discussion of good
behavior above gives us some hints about the properties of the desired membership function. It should
not be too wide, as a bad bunkering could then be regarded as good. Likewise, if it is too narrow then a
good bunkering would get a too low goodness score. It is important that the membership function
represents the best practice set as well as possible. The obvious choice is to derive the membership
function directly from the dd histogram itself.
The membership function for good bunkerings, mG, must have a maximum value of 1 at dd = 0, i.e.
mG(dd=0) = 1, and is continuously decreasing in both directions, i.e. a rescaling and shift of the H
histogram has to be done. We therefore propose the following definition of the membership function:
H (dd ) H (dd )
m G (dd ) = =
max(H ) H (0)
where the subscript G indicates that this gives a goodness scoring, and H is the smoothed (and
adjusted) best practice histogram (i.e. H is the histogram function). Note that mG is a function of the
distance of dd to 0, as well as the frequency of dd in the best practice. This membership function can
now be applied e.g. to all n supplier samples to obtain the overall goodness benchmark,

p. 6

1 n
bG = ⋅ ∑ mG (dd i )
n i =1
where the summation is done over all n bunkerings for a specific supplier, port, or geo-region.
An interesting observation is that the scoring from the membership function mG(dd) is not (a priori) a
probabilistic measure, it is a measure (0→1) based on how far away a variable is from some value, i.e.
dd=0; see Figure 8. However, this rescaling does preserve an interesting probabilistic feature, viz. the
following: the probability of finding a value x in a small interval around dd, relative to that of finding a
value y in an equally sized interval close to 0, given that the samples are drawn from the best practice
group.

Figure 8: The solid line gives the goodness membership function, mG, which is a scaling of the best practice
histogram. mB = 1-mG gives the membership function for the opposite (dashed line), i.e. bad which in turn
could be divided into a long- and short-lifting part, mLL and mSL respectively (corresponding to negative
and positive dd values). E.g. a bunkering with dd=2.3 would get a good score of mG=0.23 and a bad score of
mB=0.77 (with mLL=0 and mSL=0.77).

The Bad
Note that mG(dd) was derived based on what was chosen to be the best practice. It therefore gives a
measure/score for how good a bunkering or supplier is with respect to this best practice. The
complementary,
mB(dd) = 1 - mG(dd),
give a badness scoring but it will not tell weather the bad scoring comes from short- or long-lifting.
Fortunately, mB can, depending on whether a sample falls into the short- or long-lifting domain, be
further divided into mSL and mLL. That is, if the dd value of a sample is positive, its mSL will be greater
than zero; if the dd value of a sample is negative, its mLL will be greater than zero.
This enables us to calculate short- and long-lifting scores similar to the goodness score:
1 n
b xL = ⋅ ∑ mxL (dd i ) ,
n i =1

where the subscript xL should be SL or LL, which stands for short- or long-lifting, respectively. These
scores indicate the behavior of a supplier and give the risk of being short- or long-lifted. Note, by
definition:
bG + bSL + bLL = 1
Remember that the scores correspond to the degree of membership, i.e. how close a bunkering is to the
good or bad benchmark, they can therefore be understood as weights corresponding to the proportion
of good or bad.

The Ugly
As pointed out above, profit maximization by reporting densities at or close to the upper limit may be
considered as fairly ugly behavior. The same methodology can be applied to obtain a near limit score
for this behavior by constructing a membership function
mNC(claimed density) = mG(claimed density - 991)
where the subscript NC denotes Near Ceiling.
This membership function assigns a scoring to a bunkering corresponding to the distance from the
density limit and frequency of occurrence in the benchmark. To avoid categorizing a bunkering as
ugly when the measured density is actually near the limit, we employ a convolution of mNC and mSL. In

p. 7

so doing we exclude all reportings that are near the limit but that are actually honest. We propose the
following ugly or near limit benchmark
1 n
b NC = ⋅ ∑ mSL (dd i ) ⋅ mNC (claimed densityi )
n i =1
giving the fraction of short-lifting that could be considered as near limit reporting.

Further characterization of Good and Bad
In order to further characterize bunkering samples within the good-, short-, or long-lifting region in the
scatter plot, the average density deviations in each region could be computed by weighting each
bunkering sample with the corresponding score from the membership function. For instance, the mean
density difference ( dd SL ) in the short-lifting area is:

∑ (dd ) ⋅ m (dd )
i
i SL i
dd SL =
∑ m (dd )
i
SL i

in kg/m3, where the index i runs over all samples n.
This means, for a given supplier we can provide information about the risk of being short-lifted, bSL,
and about the expected average amount in density difference, dd SL . The method is easily extended to
the other identified behaviors.

4. Application of the benchmarks
As discussed above the power of the scatter plot lies in the visualization of the different density
reporting schemes. Several patterns, like fixed value density reporting, systematic density reporting
deviations, etc., are easily spotted. The benchmarks developed above are constructed to discriminate
between some of these different reporting schemes, and to quantify the risk of being short-lifted as
well as the amount of short-lifting that should be expected. The benchmarks for our examples from
Table 1 are given in Table 2 below.

Table 2: Standard descriptive measures together with our benchmark(s) for the geo-regions and suppliers
from Table 1. The benchmarks for the data that were used to generate the best practice histogram are also
included for comparison. A row, e.g. Global, is read as follows: average density difference is 0.39, std=3.92.
Benchmarking against the best practice gives the following results: 43% of the samples can be regarded as
good (bG), 31% qualify as short-lifting (bSL), and 26% as long-lifting (bLL). For the short-lifting samples the
average density difference is 3.31, but only 7% of them were near the ceiling.
dd σdd bG bSL bLL bNC dd SL
(kg/m3) (kg/m3)
Best Practice 0.05 1.16 0.62 0.19 0.19 0.01 1.50
Global 0.39 3.92 0.43 0.31 0.26 0.07 3.31
Canada & US West Coast 0.03 2.43 0.55 0.22 0.24 0.02 2.09
South Asia 1.22 3.35 0.41 0.52 0.07 0.26 2.44
Middle east 1.83 4.76 0.32 0.49 0.19 0.02 4.61
South America West 0.48 6.00 0.08 0.42 0.50 0.00 3.73
Supplier 1 0.12 0.95 0.71 0.09 0.20 0.02 1.70
Supplier 2 2.31 4.84 0.36 0.53 0.11 0.13 4.65
Supplier 3 2.40 1.83 0.09 0.87 0.03 0.00 2.81
Supplier 4 2.07 2.81 0.27 0.72 0.01 0.46 2.64

p. 8

The samples used to generate the best practice histogram were included in the table for easy
comparison. Note that the only way the good score can be 1 is when all samples are at dd=0, this
explains why even the good score of the best practice is ‘only’ 0.62. The table shows that for the
selected geo-regions the highest risk of being short-lifted is found in South Asia. The near-limit
benchmark, bNC, confirms what is apparent from the scatter-plot (Figure 3), that for many suppliers it
is a common practice to maximize their profit by just reporting a fuel density at or near the limit.
South America West nicely illustrates the strong ability of the benchmark to identify the underlying
behavior. Recall that for this area the mean was near zero, but the high standard deviation suggested
large fluctuations in their reporting. Even so, no indications about the underlying reporting schemes,
or the risk of being short- or long-lifted, can be deduced. In contrast, our benchmark reveals that the
likelihood of actually getting what you paid for is rather slim, viz. around 8%. In the vast majority of
the cases either short- or long-lifting takes place.
Observe also that Supplier 1 can indeed be regarded as honest with a good score higher than best
practice. Supplier 2 and 3 have comparable average density differences but their good and near limit
benchmarks clearly separates them. A comparison of the benchmarks with the corresponding scatter
plots will confirm that the benchmarks do indeed give a more accurate description of the honesty of
suppliers than standard descriptive statistics.

Figure 9: Comparison of different benchmarking methods: suppliers ranked based on their mean density
difference, dd , (top), and their corresponding good score, bG (bottom). Observe that ranking with respect
to the mean would result in about 1057 good suppliers (| dd | ≤ 0.7). Our scoring with respect to best
practice, (0.62), reveals however that about 150 are definitively bad (left-hatched area), even below global
average (0.43). 539 are rally good (equal to or better than best practice, right-hatched area) whereas the
rest are located between global average and best practice. Observe also that simply relying on the mean to
characterize suppliers would label several of them as bad even though their good score is above global best
practice.

Supplier ranking
In Figure 9 (top) all suppliers of RMG380 fuel worldwide are ranked with respect to their mean
density difference, dd . When using | dd | ≤ 0.7 as a criterion for goodness then the mean would imply
there are about 1057 good suppliers. Applying this mean dd to our benchmarking method results in
the continuous bell-shaped curve (blue). If dd is indeed an unbiased measure for the goodness of
suppliers, then their scorings should be closely scattered around this curve – this is, however, not at all
the case. This discrepancy stems from the unreliability of the mean (or standard deviation) as a
trustworthy measure whenever the underlying distributions are non-normal or outliers have a large
effect. The figure visualizes clearly that 150 of the apparently good suppliers are actually quite bad, i.e.
even below global average (left hatched area), whereas just about the half (539) can be considered
equal to or better than best practice (right hatched area). Observe also that many of the apparently bad
suppliers (those with | dd | > 0.7) are actually better then their reputation as most of them are above the
bell shaped curve, some are even above best practice – further emphasizing the need for an unbiased
score like bG.

Development over time
Following the development of the score of a supplier, port, or region over time may give valuable
indications about what may be expected in the near future. For instance, Figure 10 shows the
development of the bG score for two major ports, Singapore and Rotterdam, over the past 25 years.

p. 9

Figure 10: Time series of goodness scores bG for two large ports in different geo-regions. Data from all
available suppliers are included. Dots are quarterly time intervals while the stippled lines are year
averages. Each dot is based on a varying number of ‘raw data points’, i.e. the number of bunkerings
during the corresponding time interval.

Observe that from the beginning of the 1980s and up to the mid 1990s the quality of the density
reporting was increasing. It then leveled off until 2008, when a change in behavior occurred – perhaps
triggered by the onset of the global recession?

5. Discussion and concluding remarks
This paper has two main focus areas: the construction of a realistic benchmark and the development of
a methodology that allows comparing one or more samples with the benchmark.
The examples given above demonstrate the capabilities of our approach. It is more powerful than
standard descriptive statistics (e.g. dd and σdd), as it is less sensitive to outliers and is well suited for
small datasets and even single numbers. Recall that our benchmarks give better quantifications than
the dd and σdd together. Further, it makes no assumptions about the data distributions. There are
actually no restrictions to the probability distribution of the underlying data – any distribution is
allowed. Only some weak requirements apply to the membership function (e.g. increasing/decreasing).
The methodology is quite generic and could in principle be applied to any kind of comparison task, i.e.
benchmarking.
The fact that the benchmark is based on a probability density function, and that a probabilistic
interpretation of the scoring is possible, is an aid to the user’s intuition, making it easier to understand
and interpret the results.
Once a best practice histogram has been generated, a membership function can be derived, after which
benchmarking is easily done. Subjectivity is only involved in the definition of what can be regarded as
best practice, as there is no a priori correct answer to this problem. Our approach has been to ask:
what should be expected of a good supplier? And by answering this question we have picked suppliers
that best match our expectations. Outliers and incorrect claims near the density limit are of course not
wanted from a good supplier, hence their removal from the best practice data set.
From a user perspective the main strengths of the presented benchmark are:
• Institutive and easy to understand.
• Applicable for few or even singleton samples.
• Able to pinpoint different density reporting schemes.
In closing let us return to the extent and amount of global short-lifting which is estimated to be around
1.7 ton per bunkering on average. Thanks to our benchmarking methodology we can now provide a
more detailed picture of the situation. First, 43% of the bunkerings could be considered to be loss
neutral (bG=0.43), since they are within best practice. Second, 26% are instances of long-lifting
(bLL=0.26), where the buyer gains on average 1.8 ton. Third, 31% could be regarded as short-lifting
(bSL=0.31), with an average buyer loss of 2.5 ton per bunkering. This highlights the importance of
choosing the right supplier.
The presented benchmark methodology is easily extendable to other (quality and economical)
bunkering parameters like viscosity, sulfur or water content, as well as a series of physical and
chemical properties. The methodology will be the basis for a benchmarking web tool, scheduled for
release by DNVPS later this year.

Figure 11: Bunker surveyor on board a ship. Photo by DNV Petroleum Services (used with
permission).

p. 10

References

Bhattacharyya, G., Johnson, R. (1977), Statistical Concepts and Methods, Wiley, New York.
DNV (2010). Total fuel management,
http://www.dnv.com/industry/maritime/servicessolutions/fueltesting (accessed 13. Oct. 2010).
EPA (2008), Global Trade and Fuels Assessment -Future Trends and Effects of Requiring Clean Fuels
in the Marine Sector. Assessment and Standards Division Office of Transportation and Air Quality,
U.S. Environmental Protection Agency. EPA420-R-08-021, November 2008.
Eyring, V., Isaksen, I.S.A., Berntsen, T., Collins, W.J., Corbett, J.J., Endresen, O., Grainger, R.G.,
Moldanova, J., Schlager, H., Stevenson, D.S. (2010), “Transport impacts on atmosphere and climate:
Shipping”, Atmospheric Environment, Volume 44, Issue 37, December 2010, pp. 4735-4771.
Hastie, T., Tibshirani, R., Friedman, J. (2009), The Elements of Statistical Learning: Data Mining,
Inference, and Prediction (second edition). Springer, New York.
IEA (2010). World Energy Outlook 2010. International Energy Agency, OECD Publishing, Paris.
IMO (2009). Prevention of Air Pollution from Ships. International Maritime Organization, Marine
Environment Protection Committee. MEPC 59/INF.10, 9 April 2009.
Lowen, R. (1996), Fuzzy Set Theory, Kluwer Academic Publishers, Dordrecht.
Self, K. (1990), “Designing with fuzzy logic”, IEEE Spectrum, Vol 27, No 11, November 1990, pp.
42-44, p. 105.
Terano, T., Asai, K., Sugeno, M. (1987), Fuzzy Systems Theory and its Applications. Academic Press,
San Diego.
Turksen, I.B. (1991), “Measurement of membership functions and their acquisition”, Fuzzy Sets and
Systems, Vol. 40, pp. 5-38.

p. 11

Figure 5

Limit
max. cheat area
991

981
Bad

Claimed density
971

Good Suspicious
Limit

961
961 971 981 991
Measured density

Figure 6

Limit = max. cheat line

=

+

-
e
t lin
ea
o ch
n Limit

Figure 7

Probability
Density
deviations

Figure 8

Long-lifting 1 Short-lifting

mG
mB=1-mG
Bad: mB =1-0.23
= 0,77

Good: mG = 0.23

0 dd = 2.3 density difference

Figure 9

10

5

Ca. 1057 suppliers
0.7
total number
0
of suppliers
- 0.7 0 500 1000 1500 2000 2500

claimed – measured density
-5

1
539 Some “bad suppliers” are
actually very good !

0,75
Best practice score

0,5
Global average score

Good score
Some “bad suppliers” are
actually slightly better !
0,25
Many “good suppliers” are
actually quite bad !
150
0
0 500 1000 1500 2000 2500

6.benchmarking of

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie 6.benchmarking of

Ähnlich wie 6.benchmarking of (20)

Mehr von libfsb

Mehr von libfsb (20)

6.benchmarking of