Software bill of materials: strumenti e analisi di progetti open source dell’amministrazione pubblica

University of Trieste
Department of Engineering and Architecture
Computer & Electronic Engineering
Master’s Thesis
Software bill of materials: tools and analysis
of open source projects of the public
administration
Candidate:
Federico Boni
Supervisor:
Prof. Alberto Bartoli
Accademic Year 2021–2022

Abstract
Durante il ciclo di vita di un software, sono molti gli elementi coinvolti nei
processi di realizzazione e distribuzione dello stesso. Si pensi agli strumenti di
sviluppo o alle librerie di terze parti utilizzate. Tutti questi elementi, insieme,
definiscono la catena di fornitura del software, o supply chain.
Quando si realizza un software, modularita
̀ e riutilizzo del codice sono pratiche
efficaci per non affrontare problemi gia
̀ risolti. Queste pratiche portano all’
utilizzo di componenti esterni detti dipendenze, spesso open source (librerie
software). Tuttavia, e
̀ importante notare che questi componenti sono soggetti
a vulnerabilita
̀, e la presenza di queste vulnerabilita
̀ si riflette in potenziali
rischi per il software stesso. Il monitoraggio della supply chain e
̀ quindi essen-
ziale per le organizzazioni coinvolte nello sviluppo di software.
Il riconoscimento dell’importanza di questo problema ha portato alla nascita
degli SBoM. Si definisce Software Bill of Materials, o SBoM, un documento
formale per tenere traccia di ciascuno dei componenti utilizzati all’interno di
un artefatto software, ovvero gli elementi della supply chain.
Questo studio utilizza tecnologie esistenti per la raccolta di dipendenze e la
creazione di file SBoM per alcuni artefatti software. I componenti della supply
chain presi in considerazione sono librerie provenienti da diversi ecosistemi di
linguaggi di programmazione. Alla creazione degli SBoM segue poi un’analisi
delle vulnerabilita
̀, per l’identificazione di potenziali vulnerabilita
̀ del software
causate da elementi della supply chain. Gli artefatti software considerati sono
progetti open source di agenzie governative di 4 paesi: si definiscono 4 dataset
contenenti repository GitHub di Italia, Germania, Regno Unito e Stati Uniti.
I risultati ottenuti rivelano alcune differenze tra i dataset, sia in termini di
dipendenze da pacchetti software che di vulnerabilita
̀. Molte repository risul-
tano dipendere da pacchetti di terze parti; si nota poi come un ristretto nu-
mero di pacchetti presenti un alto numero di vulnerabilita
̀. Inoltre, si osserva
come alcuni pacchetti vulnerabili siano ampiamente utilizzati fra le reposi-
tory dei dataset, rendendo quest’ultime potenzialmente vulnerabili. Durante
il processo di creazione ed analisi degli SBoM emergono poi alcune criticita
̀
nella gestione di alcune dipendenze e nella compatibilita
̀ fra gli strumenti di
costruzione ed analisi utilizzati.
In ultimo, dopo aver analizzato i risultati ottenuti e le limitazioni degli stru-
menti utilizzati, si osserva come gli standard SBoM possano essere un modo
efficace per tenere traccia dei componenti di un software e analizzare sistem-
aticamente la presenza di vulnerabilita
̀. Tuttavia, si nota come non risulti
ancora possibile definire una procedura per la creazione e l’analisi di SBoM
per una repository GitHub che sia: accurata nella raccolta delle dipendenze,
coerente nella gestione di ecosistemi differenti e completa nella raccolta
̀ delle
vulnerabilita
̀.
1

Contents
1 Introduction 4
2 Supply-Chain Security in Open Source Software 5
2.1 Introduction to the problem . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Software ecosystems, dependencies and packages . . . . 5
2.1.2 Software supply chain . . . . . . . . . . . . . . . . . . . 6
2.1.3 Open source and supply chain . . . . . . . . . . . . . . 6
2.2 SBoM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Software Bill Of Materials . . . . . . . . . . . . . . . . . 7
2.2.2 Standards . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Review of the Literature . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Top Five Challenges in Software Supply Chain Security:
Observations From 30 Industry and Government Orga-
nizations . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 An Empirical Comparison of Dependency Network Evo-
lution in Seven Software Packaging Ecosystems . . . . . 9
2.3.3 Structure and Evolution of Package Dependency Networks 10
3 GitHub data collection 12
3.1 GitHub and Government . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Organizations characteristics per country . . . . . . . . 13
3.1.2 Repositories characteristics per country . . . . . . . . . 16
3.2 The case study: 4 datasets for Italy, Germany, the United States
and the United Kingdom . . . . . . . . . . . . . . . . . . . . . . 18
4 Methods and Analyses 23
4.1 SBoM creation and vulnerability detection . . . . . . . . . . . . 23
4.2 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Creation of SBoMs: critical issues . . . . . . . . . . . . . . . . 29
4.3.1 Conditional dependencies . . . . . . . . . . . . . . . . . 29
4.3.2 Version constraints . . . . . . . . . . . . . . . . . . . . . 31
4.4 Vulnerabilities collection: critical issues . . . . . . . . . . . . . 32
4.4.1 Grype and Java vulnerabilities . . . . . . . . . . . . . . 32
4.4.2 Grype false positive . . . . . . . . . . . . . . . . . . . . 33
4.5 Dependency analyses . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5.1 Packages and dependencies: Viewpoints of package man-
agers, datasets and repositories . . . . . . . . . . . . . . 35
4.5.2 Critical dependencies . . . . . . . . . . . . . . . . . . . . 39
4.5.3 Manifest vs Parsed dependencies . . . . . . . . . . . . . 41
4.6 Vulnerability analyses . . . . . . . . . . . . . . . . . . . . . . . 43
4.6.1 Viewpoint of repositories . . . . . . . . . . . . . . . . . 44
4.6.2 Viewpoint of packages . . . . . . . . . . . . . . . . . . . 50
5 Discussion and conclusions 54
2

6 Appendix 57
6.1 GitHub & Government list . . . . . . . . . . . . . . . . . . . . 58
6.2 Database connector . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.4 Organization data (GitHub APIs) . . . . . . . . . . . . . . . . 61
6.5 Repository data (GitHub APIs) . . . . . . . . . . . . . . . . . . 62
6.6 Contributor data (GitHub APIs) . . . . . . . . . . . . . . . . . 63
6.7 SBoM (sbom-tool) . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.8 Parsed dependencies (Python) . . . . . . . . . . . . . . . . . . . 65
6.9 Parsed dependencies (JavaScript) . . . . . . . . . . . . . . . . . 66
6.10 Vulnerabilities data (Grype) . . . . . . . . . . . . . . . . . . . . 67
6.11 Vulnerabilities data (OSV API) . . . . . . . . . . . . . . . . . . 68
6.12 Figures (Bar charts) . . . . . . . . . . . . . . . . . . . . . . . . 69
6.13 Figures (Line charts) . . . . . . . . . . . . . . . . . . . . . . . . 71
6.14 Figures (Venn diagrams) . . . . . . . . . . . . . . . . . . . . . . 73
6.15 Figures (Pie charts) . . . . . . . . . . . . . . . . . . . . . . . . 74
6.16 Figures (Heat maps) . . . . . . . . . . . . . . . . . . . . . . . . 76
3

1 Introduction
The supply chain of a software is the set of elements of any type that have a
role in the life cycle of a given software artifact. Cybersecurity attacks related
to a software supply chain can be both attacks on third-party supply chain el-
ements (known as supply chain attacks) or attacks that exploit vulnerabilities
in third-party components.
During software development, modularity and code reuse are effective prac-
tices to avoid dealing with problems that have already been solved. Nowadays
softwares are not created from scratch, but are a combination of various ex-
ternal components called dependencies, often open source (software libraries).
However, these components are subject to vulnerabilities and developers that
uses them do not have full control over their code. It is important to note
that the presence of vulnerabilities in the supply chain, is reflected in poten-
tial risks for the software itself. Monitoring the elements of the supply chain
is therefore essential for organizations involved in software development.
This work considers the concept of SBoM (Software Bill of Materials), a for-
mal record containing the list of components of a software, i.e. the elements of
the software’s supply chain. The concept of SBoM began to be known when,
on 12 May 2021, the President of the United States signed an Executive Or-
der [1] with standards and best practices concerning the cybersecurity of the
United States. That Executive Order recommended the use of SBoMs to track
supply chain relationships of software applications.
SBoM standards are specifically designed to be machine readable. This makes
it possible to programmatically and constantly track the dependencies start-
ing from a SBoM. Moreover the strength of using a machine-readable SBoM
format lies in the possibility of defining formal procedures for analysed it:
software vendors can use SBoMs as input for vulnerability analysis softwares
capable of matching each third party component described in the SBoM with
existing vulnerabilities.
Here, a special focus is given to the dependencies that come from various pro-
gramming language ecosystems. These dependencies are third-party libraries,
modules or packages that can be installed using different package managers
for each language. Also potential cybersecurity issues caused by vulnerabili-
ties of these dependencies are taken into account. In detail, in this work, an
attempt is made to build and analyse SBoMs starting from a set of GitHub
repositories relative to the pubblic administration of different countries. The
countries considered are Italy, Germany, the United Kingdom and the United
States.
This work reports the results obtained in terms of dependency and vulner-
ability analysis of the GitHub repositories under consideration, as well as
problems and critical issues related to the tools used.
4

2 Supply-Chain Security in Open Source Software
2.1 Introduction to the problem
2.1.1 Software ecosystems, dependencies and packages
The work in [2] defines a software ecosystem as: “A collection of software
projects, which are developed and co-evolve in the same environment”.
An environment can be defined by the components of an organization, by a
community (e.g. open source communities) or by a programming language or
framework.
In this work a software ecosystem is intended as an ecosystem defined by a
programming language and its package managers.
Given a software artifact, a dependency of it is any other artifact (libraries,
plugins, ...) that the software artifact requires in order to work as expected.
These dependencies may either be parts of the software itself (i.e. developed
by the same programmers) or be third-party components. In the latter case,
it is distinguished between proprietary dependencies and open source depen-
dencies. The set of dependencies of a software artifact defines the so-called
dependency tree. Dependencies are usually divided into direct and transitive
dependencies:
• Direct dependency: A dependency d of a software artifact x is defined
a direct dependency when x refers to d directly.
• Transitive dependency: A d1 dependency of a software artifact x is
defined a transitive dependency when d2 is a dependency of x and d1 is
a dependency of d2.
When a developer wants to distribute his code (e.g. a library) in a certain
ecosystem, he can distribute a so-called package using a target packet man-
ager. A package can contain a stand-alone tool as well as a library or a set of
libraries. Programming languages use package managers to resolve the depen-
dencies (packages) defined in the sofware dependency tree during the build of
a software artifact in its ecosystem.
How dependency information is stored and how a dependency is resolved de-
pends strictly on the software ecosystem, where dependencies resolution can
occur at compile time as well as at runtime. The way authors of a software arti-
facts decide which dependencies to use and how to manage them may affect the
reliability of the software, which may be compromised due to its dependencies.
For example, if a developer use a package that is no longer present in a given
package manager registry, his software will no longer be installable using that
package manager. This is the case of the npm package left-pad [3], a pack-
age composed of a single function of 11 lines. When its owner, a self-taught
high school graduate programmer from Turkey, removed it from the npm reg-
istry on May 2016, many packages that depended on it could not be installed.
Among them, the most famous were Babel, Atom, React.
5

2.1.2 Software supply chain
The software supply chain is the set of elements of any type that plays a role
in the software development life cycle (SDLC) of a given software artifact.
Dependencies of a software artifacts are certainly part of its supply chain,
and software artifacts have more and more dependencies: the authors of [4]
report how, up to 2017, the total number of dependencies of the packages of
the package managers they considered (cargo, cpan, cran, npm, nuget , pack-
agist, and rubygems) continues to grow over time.
Also, software developers relies on third party dependencies (open source or
proprietary) simply to not reinvent the wheel, speeding up the development
process. However this massive use of third party dependencies helps to in-
crease the number of elements in the software supply chains. The increase
of dependencies in a software artifact leads to the problem of the so-called
dependency hell, a colloquial term to define various problems related to the
increase of dependencies in a project, including the size and the number of de-
pendencies, the presence of conflicts or the presence of circular dependencies.
According to the 2022 Verizon Data Breach Investigation Report DBIR [5],
supply chain attacks have increased last year, accounting for 9% of their total
incident corpus.
2.1.3 Open source and supply chain
The use of open source software artifacts has the benefits of using products
that are surrounded by strong communities that have the aim of improving
their softwares. In addition, the usage of an open source software artifact does
not require any cost and its code is freely available to anyone.
However, the downside is that having the source code available constantly
is a great advantage for an attacker, and as highlighted by [6] even new vul-
nerability fixes can help the attacker in finding similar vulnerabilities in other
softwares. Another problem is that adding an artifact to our supply chain
means completely trust its author, and an open source artifact can be either
managed by a large important organization or a single person (e.g. left-pad
package).
In the 2021 State of the Software Supply Chain Report [7], Sonatype reports a
650% increase in detected supply chain attacks aimed at exploiting weaknesses
in upstream open source ecosystems. The year before, Sonatype registered an
increase of the 430% of these supply chain attacks.
The threat model of these attacks take in consideration an attacker that in-
jects malware directly into open source projects to infiltrate the supply chain.
As per the Sonatype report, this software supply chain attacks are insidious
because attackers do not wait for public vulnerability disclosures to exploit:
6

instead, they inject new vulnerabilities into open source projects which are
part of the supply chain.
In this work, only dependencies such as libraries and open source packages
used within software artifacts will be considered as elements of the supply
chain. In detail, the ecosystems considered will mainly concern open source
dependencies of ecosystems of specific languages and specific package man-
agers, such as npm for Javascript or Maven for Java.
However, it should be emphasized that the components of the software supply
chain do not only concern those taken into consideration in this work. For
example, a software artifact published in a production environment with the
help of a package manager will have that package manager in its supply chain,
as well as all the other softwares used in the publication process.
Regarding the software artifacts taken into account, in this work have been
considered only open source projects of public administrations.
2.2 SBoM
2.2.1 Software Bill Of Materials
As noted in the previous sections, dependency management and supply chain
attacks are critical issues in managing a software artifact.
On May 12, 2021, the President of the United States signed an Executive
Order [1] with standards and best practices concerning the Nation’s Cyberse-
curity. In the “Enhancing Software Supply Chain Security” section, the use
of SBoMs is recommended to keep track of the supply chain relationships of
software applications.
The SBoM concept born as a collaborative community project driven by the
National Telecommunications and Information Administration’s [8] in 2018.
The Executive Order gave a brief description of what a SBoM is:
The term “Software Bill of Materials” or “SBoM” means a for-
mal record containing the details and supply chain relationships of
various components used in building software.
A machine-readable SBoM format allows a software vendor to track each de-
pendencies (libraries, other softwares or components) of its software artifact
and make sure they are up to date in a programmatic way. As per the Ex-
ecutive Order, buyers can use a SBoM to perform vulnerability analyses to
evaluate the risk of a product and vendors can determine wether their soft-
ware are at potential risk of a newly discovered vulnerability. A machine-
readable SBoM can be used with automation and integration tool to exploit
its potential of understanding the supply chain of a software.
7

2.2.2 Standards
Several SBoM standards have been developed to provide a unified approach
for generating SBoMs and sharing them. A SBoM standard provide a schema
that describe a software artifact structure and dependencies in a way that is
consumable by other tools for monitoring and managing the supply chain (e.g.
software for vulnerability analysis).
Among the various standards, the most common are:
• Cyclone DX [9], a lightweight SBoM standard designed for use in appli-
cation security contexts and supply chain component analysis. It allows
SBoM to be created in xml or json formats; it is an open source project
maintained by the OWASP Foundation.
• SPDX (Software Product Data Exchange) [10], an open standard hosted
by the Linux Foundation for communicating software bill of material
information. It allows SBoM to be created in yaml, json, xls and rdf
formats.
2.2.3 Tools
There are several tools for the SBoMs creation. These tools can take as input
a docker image, other SBoMs or a source folder. They automatically search for
components used within a project, such as libraries and packages used within
a certain ecosystem by parsing files containing the list of dependencies for that
ecosystem.
Some of the software for creating SBoMs are Syft [11] from Anchor and sbom-
tool [12] from Microsoft. They are both publicly available on GitHub. While
the former can generate SBoM in CycloneDX, SPDX, and Syft’s own format,
the latter can generate SBoM only in SPDX 2.2 format. Cyclone DX project
has a collection of tools in the Tool Center section of its website [9] that can
be used to build SBoMs in Cyclone DX format.
In this work sbom-tool by Microsoft is used. Sbom-tool is capable of building
SBoMs in SPDX 2.2 format, taken as an input the source folder of a project.
2.3 Review of the Literature
2.3.1 Top Five Challenges in Software Supply Chain Security: Ob-
servations From 30 Industry and Government Organizations
In the work in [4] the authors have conducted three software supply chain
security summits (two industry and one government summit). A total of 30
organizations from different sectors but all from the United States attend these
3 summits. In the paper, the five most important challenges in supply chain
security that were identified among these summits are presented.
Challenge 1 - Updating of vulnerable dependencies. It has been noted how
8

a quick update to the latest version of a vulnerable dependency can introduce
malicious code. Among the participants of the summits, there was advice
such as: never be the first or last to update a dependency; adopt continuous
integration/continuous deployment (CI/CD) policies to prevent the inclusion
of vulnerable dependencies; maintain a “zero trust” policy for dependencies.
Challenge 2 - Leveraging the SBoMs for Security. The US resident’s ex-
ecutive order brought the old concept of SBoMs to the forefront. Some par-
ticipants found the sharing of SBoMs harmful: the use of a dependency is not
atomic and often the developers pull in only specific parts. So why not simply
request accurate and timely vulnerability information? Other participants felt
that SBoMs provide a way for a zero-trust approach for the supply chain and
that SBoMs have the potential to lay the foundations for innovative security
improvements that leverage SBoMs.
Challenge 3 - Choosing Trusted Supply Chain. A crucial point is whether to
trust the maintainers of a library, an organization, or the integrity of the build
environment. Package maintainers are looking for ways to automatically iden-
tify malicious packages, for instance by using only the metadata of a package.
However, all techniques used so far present technical challenges.
Challenge 4 - Securing the Build Process. The recent use of CI/CD tools
opens up the possibility for new attacks to inject malicious code during the
build process. Participants were largely positive about the use of Supply Chain
Layers for Software Artifacts (SLSA, a framework that provides a checklist of
standards to be met during the build process). However, there is still a lack
of knowledge about which the risks are.
Challenge 5 - Getting Industry-Wide Participation. The big tech compa-
nies have been working a long time to solve supply chain security problems,
with great (and manual) efforts that only help their company. However, some
of the major players are already coming together. Some noteworthy guidelines
and methodologies are the Building Security in Maturity Model (BSIMM) and
the Open Web Application Security Project (OWASP).
2.3.2 An Empirical Comparison of Dependency Network Evolution
in Seven Software Packaging Ecosystems
The work in [13] analyses the dynamical evolution of package dependency
networks for seven packaging ecosystems (defined by seven package managers):
cargo (Rust), cpan (Perl), cran (R), npm (Javascript), nuget (.NET), packagist
(PHP), rubygems (Ruby) using the libraries.io dataset (LINK). The temporal
data covers the period of time from the birth of the ecosystem until the end
of 2016. The authors analysed the seven ecosystem in order to answer to 4
research questions:
• How do package dependency networks grow over time?
• How frequently are packages updated?
9

• To which extent do packages depend on other packages?
• How prevalent are transitive dependencies?
The results obtained are summarized below:
• All the ecosystems taken into account seems to grow over time (in terms
of number of packages). The ones that seem to grow faster are cran and
npm, which do so exponentially. The ratio of dependencies over packages
remains stable for cpan, packagist, rubygems, while increases for all the
others.
• In all ecosystems the number of package updates is always stable or is
growing over time. The majority of updates:
– Come only from a small set of active packages.
– Involve packages that are no older than 12 months.
• A majority of dependent packages depend on a minority of required
packages. Among the latter, only a small subset of them produce an
high proportion of reverse dependencies.
• In all ecosystems, more than half of the top level packages (packages
that are not dependencies) have a dependency tree of depth greater or
equal to three.
2.3.3 Structure and Evolution of Package Dependency Networks
Similarly to what was done in the previous work, also the work in [14] fo-
cuses on analysing the structure and evolution of networks of dependencies for
different ecosystems. The ecosystems considered in this work are related to
Javascript, Ruby and Rust. Packages data have been obtained from the cen-
tral repositories for Javascript and Ruby (respectively npm and RubyGems)
and from GitHub for Rust. Analyses also considered end user applications
from GitHub for all three ecosystems.
The authors divided the analyses trying to answer to the following 3 research
questions:
• What are the static characteristics of package dependency networks?
• How do package dependency networks evolve?
• How resilient are package dependency networks to a removal of a random
project?
The results obtained are summarized below:
• The number of transitive dependents for JavaScript is almost two times
the number of transitive dependents for other languages. In addition
dependency networks (represented as direct graphs) of all the ecosys-
tems present a giant weakly connected component (composed by 96.14%,
98.2%, 100% of projects for Rust, JavaScript and Ruby, respectively).
10

• The total amount of dependencies for each project release grows faster
for JavaScript projects then for the others, with an average size of to-
tal dependencies that registered more than 60% yearly growth between
2015 and 2016. Looking at the variation of the numbers of direct and
transitive dependencies between 2005 and 2017, it can be noted that
JavaScript projects have more transitive dependencies than the others,
but less direct dependencies. The authors suppose that the high number
of transitive javascript dependencies is due to the possibility of having
multiple versions of the same package in a given project.
• Every studied ecosystem taken into account has at least one package
whose removal could impact the 30% of the other packages (and appli-
cations). Among the more dependent packages, those of Javascript are
utility packages, those of Ruby are packages related to web servers and
those of Rust are interfaces to system level types and libraries.
11

3 GitHub data collection
3.1 GitHub and Government
In this work, open source organizations of the public administration of dif-
ferent countries are considered. To do this, the GitHub page “GitHub and
Government” [15] is used. The GitHub and Governemnt page contains a list
of the GitHub organizations registered as government agencies at the national,
state and local level for different countries. This GitHub list is subdivided in
three sections:
• Governments (includes 887 organizations)
• Civic Hackers (includes 305 organizations)
• Government-funded Research (includes 123 organizations)
The sections “Governments” and “Government-funded Research” are subdi-
vided in subsections, each associated with a specific country (except for “Eu-
ropean Union” and “United Nations” subsections, that have respectively 1
and 10 organizations).
The section “Civic Hackers” is subdivided in a few subsections with no clear
association with any country:
• Civic Hackers (includes 159 organizations)
• Code for All (includes 18 organizations)
• Code for America (includes 108 organizations)
• Open Knowledge Foundation (includes 20 organizations)
The list of organizations was obtained on October 11th, 2022. After obtaining
the list, a country is assigned to each organization, as follows:
• If org was listed in a subsection associated with a country c, org is
associated with c.
• If org metadata (obtained with repos API of GitHub [18]) include geo-
graphical information, then org is associated with the country c resulting
from those metadata.
• If an organization is listed in the “Civic Hackers” section and “Code
for America” subsection, then org is associated with country c=“United
States”.
• Otherwise, org is not associated with any country.
Based on this procedure, 73 countries with at least one organization are found
and 102 organizations not associated with any country (these include 10 or-
ganizations listed as “United Nations” and 1 listed as “European Union”) are
found.
12

In addition, 11 organizations were removed from the list. This is because,
although appearing in the GitHub page, they are no longer available on the
platform.
3.1.1 Organizations characteristics per country
Figure 1 shows the number of GitHub organizations of the GitHub list for the
top 16 countries per number of organizations. It can be seen that the United
States has over 500 organizations, followed by the United Kingdom with just
over 100 organizations. The number of organizations decreases rapidly down
to Italy, in 16th place, with 9 organizations.
Figure 1: Number of GitHub organizations per country for the top 16 countries per
number of organizations.
Figure 2 shows the fraction of the number of organizations that existed in a
certain year for each country. The countries taken into consideration are the
16 countries of Figure 1.
It can be seen that trends are similar for all countries, with exponential growth
in the period 2012 - 2018. More than half of the countries reached the 80% of
the total number of organizations by the beginning of 2017.
13

Figure 2: Fraction of the number of existing GitHub organizations per country over
the years.
Figure 3 shows the average number of members and followers of the GitHub
organizations of the 16 countries with the most number of organizations.
Interestingly, the higher average number of followers is obtained by italian
organizations and it is more than 60. This is due to the organization Devel-
opers Italia, which registered 510 followers on 11 October 2022. Apart from
Italy, the countries with the most followers on average are the United States,
United Kingdom and France, with an average number of followers that is
more than 10. For all countries, the average number of members of a GitHub
organization is less than 10.
14

Figure 3: Average number of members and followers for the top 16 countries per
number of organizations.
Figure 4: Number of existing GitHub repositories per organizations for the top 16
organizations per number of repositories.
The organizations on the GitHub list with the largest number of repositories
are now considered individually. Figure 4 presents the cumulative curves of the
number of repositories per creation date per organization. The organizations
15

considered are the 16 organizations with the largest number of repositories in
the entire GitHub list.
It can be seen that all the organizations in Figure 4, except for navikt and
bcgov, are from the United States or the United Kingdom. The organization
with the most repositories is navikt, from Norway, which has almost 1750
repositories. A total of 9 organizations have more than 1 000 repositories.
3.1.2 Repositories characteristics per country
Figure 5: Total number of repositories of all organizations of the top 16 countries
per number of organizations.
Figure 5 shows the total number of repositories among all organizations in
each country. It can be seen that the United States and the United Kingdom,
the countries with the highest number of organizations, also have the highest
number of repositories. In particular, the United States has a total of over 25K
repositories, almost twice the number of repositories of the United Kingdom.
Figure 6 shows the average number of repositories per country. The aver-
age is computed across the organizations of each country.
The highest value is recorded by the United Kingdom, with an average of more
than 140 repositories per organization. Right after United Kingdom, there are
Norway, Finland and Italy, with an average of more than 80 repositories per
organization. It can be noted from Figure 3 that for these 4 countries, the
average number of members per organization is consistently less than 10. The
most extreme case, concerns the United Kingdom organizations: on average,
they have more than 140 repositories and less than 3 members per organiza-
tion managing them.
16

Figure 6: Average number of repositories of all organizations of the top 16
countries per number of organizations
Given a repository, GitHub provides the amount of code (in terms of kB) in
the repository for each programming language. It then assigns a language to
each repository as the most frequent language in the repository.
Figure 7 presents the relative number (compared to the total number of repos-
itories) of repositories with the most frequent language for the repositories of
all organizations in each of the 16 country already taken into account.
It can be seen that Python is the most frequent language in 6 countries, fol-
lowed by JavaScript which is the most frequent in 5 countries. It can also be
seen that for the repositories of Norway, Kotlin is the most frequent language,
while for those of Sweden and Spain, PHP is the most frequent.
On the other hand, for most of the repositories of the United States, GitHub
does not specify any language; this is the case of repositories that do not
contain code but contain other things, such as documentation or datasets.
Figure 7: Fraction of the total number of repositories labeled with the most popular
language for the top 16 countries per number of organizations.
Among the cases in which no language is assigned to a repository, there is also
the case in which a repository is empty. Figure 8 shows the relative number
(compared to the total number of repositories) of empty repositories of all
organizations in each of the 16 country. It can be seen that for 7 countries,
17

the number of empty repositories is more than the 2% of the total. The highest
value is recorded by the repositories of Mexico, where more than 5% of them
are empty.
Figure 8: Fraction of empty repositories with respect to the total number of
repositories for the top 16 countries per number of organizations.
3.2 The case study: 4 datasets for Italy, Germany, the United
States and the United Kingdom
In this work, it is decided to consider a limited number of countries: Italy,
Germany, the United States and the United Kingdom. Since the entire set of
repositories of countries United States and United Kingdom amount to more
than 28K and 15K respectively, it has been decided to not study these sets
entirely.
In particular, it is chosen to analysed the following Github repositories:
• The repositories of all the 9 organizations with country = “Italy” (713
repositories).
• The repositories of all the 30 organizations with country = “Germany”
(1 308 repositories).
• The repositories of the organization US General Services Administration,
country = “United States” (937 repositories).
• The repositories of the organization Goverment Digital Service, country
= “United Kingdom” (1 563 repositories).
From now on, these four datasets will be named using the abbreviations IT,
DE, US and UK.
18

(a) IT (b) DE
(c) UK (d) US
Figure 9: Fraction of GitHub repositories with respect to the total number of
repositories per language for the 4 datasets. The languages taken into account are
those for which, among all datasets, there are at least 30 repositories marked with
that language.
The distributions of the programming languages associated with each reposi-
tory for each dataset are summarised in Figure 9.
The figure shows what emerged in the previous section: the most used lan-
guages in the repositories of Italian and German organizations are Python
and JavaScript respectively. In the US dataset, most of the repositories (more
than 20%) do not contain any code, while in the UK dataset the most used
language is Ruby.
19

Size (MB) Stars
Avg Med Max Avg Med Max
IT 47.52 0.72 4432 12.86 1 3929
DE 19.22 0.59 1818 11.44 0 3352
US 24.05 0.30 1511 9.46 1 1899
UK 12.59 0.35 8299 7.16 1 768
Watchers Forks Open Issues
Avg Med Max Avg Med Max Avg Med Max
IT 9.43 8 208 7.68 1 2297 5.75 1 383
DE 5.37 4 208 3.11 0 601 4.00 0 403
US 10.83 9 287 6.17 2 457 5.23 0 741
UK 24.10 18 131 4.99 2 275 2.77 0 280
Table 1: Average, median and maximum of repositories size, number of stars,
number of watchers, number of forks and number of open issues for the 4 datasets.
Table 1 shows the average, maximum and minimum values of the repository
sizes, the number of stars, watchers, forks and open issues for all repositories
of the 4 datasets.
It can be seen that the repositories of the IT dataset have the largest average
size and the highest number of stars.
A watcher, for a GitHub repository, is an user that follows the repository to
stay up-to-date on it. The UK dataset has on average the highest number
of watchers, while on average the IT dataset has the highest number of open
issues and forks. It can be also noted that the median values of the number
of stars, forks and open issues are close to 0 and far from the average values
for all datasets. This indicates that for each of these parameters, the major
outliers of the related distribution come from the low end of the distribution.
Regarding the number of repositories, Figure 10 displays the creation date
and date of last update for each repository of the 4 datasets. In detail, each
segment of a graph represent a repository and: starts at the creation date
and end at the last update date of the repository. Segments are ordered from
bottom to top by creation date.
For each segment, the left extreme point represents the creation date of the
repository relative to that segment. Now take the set of left extreme points
of all the segments. The resulting curve formed by these points represents the
cumulative number of repositories in the dataset over the years.
These curves show that the number of repositories steadily increases in all the
datasets; this indicates how the organizations of the 4 datasets appear to be
active from 2013 to the present day.
20

(a) IT (b) DE
(c) UK (d) US
Figure 10: For each dataset, a set of stacked segments where each one: represent a
repository, starts at the creation date and end at the last update date of the
repository. Segments are order from bottom to top by creation date.
21

Although the number of repositories of organizations of each dataset seems to
be growing steadily, not all repositories are constantly updated. In particular,
in the last year (since 11 October 2021) have been updated:
• The 50.35% of the repositories of the IT dataset.
• The 86.54% of the repositories of the DE dataset.
• The 83.03% of the repositories of the UK dataset.
• The 42.50% of the repositories of the US dataset.
In particular, it can be observed from Figure 10 that:
• A lot of the IT repositories created between 2013 and 2015 were almost
immediately inactive.
• A large portion of US repositories became inactive in early 2018 (there
is an almost visible vertical line just after 2018).
22

4 Methods and Analyses
4.1 SBoM creation and vulnerability detection
After obtaining the metadata for the organizations and repositories of the IT,
DE, UK and US datasets, the following actions have been executed for the 4
datasets.
Step 1 - First, each repository of the 4 datasets is downlaoded and the Mi-
crosoft sbom-tool [12] is executed on all the repositories. This tool takes the
set of source files that compose a software artifact (e.g., a Github reposi-
tory, a Python package and alike) as input and provides the corresponding
software bill of materials (SBOM) in the SPDX 2.2 format as output. The
constructed SBoM contains the list of dependencies of the repository. Such list
is constructed by the component-detection package scanning tool, also avail-
able on Github [16]. The programming language ecosystems supported are:
CocoaPods, Conda, Gradle, Go, Maven, npm, Yarn, NuGet, PyPi, Poetry,
Ruby, Cargo. Full details can be found in the tool documentation [17].
Content and structure of the list of dependencies constructed by [16] depend
on the software ecosystems used by the artifact under analysis. Specifically:
• The set of dependencies is extracted from manifest files, whose names
are expected to follow a predefined pattern specific to each supported
ecosystem.
• For most supported ecosystems, transitive dependencies (i.e., dependen-
cies of dependencies) are also extracted.
Dependencies take the form of a tuple (package manager, package name, pack-
age version). In a given repository, there could be multiple dependencies from
a given (package manager, package name) pair. If those multiple dependencies
result from manifest files in different folders, then sbom-tool includes one tuple
for each different package-version found. Otherwise, i.e., when the multiple
dependencies result from the same manifest file, sbom-tool may include one
or more tuples (each with a different package-version) according to a certain
set of ecosystem-specific rules.
Step 2 - Second, it is attempted to find in the analysed repositories fur-
ther dependencies possibly missing from those obtained by sbom-tool at the
previous step. This additional step is executed because preliminary analyses
showed that manifest files might not be fully accurate in listing the depen-
dencies that can indeed be found statically. Specifically, it is constructed a
further list of dependencies for each repository, that is kept separate from the
one constructed at the previous step, by executing the following actions:
1. Execute the check-imports [18] package on each repository associated
with the JavaScript language. This package is publicly available on
GitHub and constructs dependencies for JavaScript and TypeScript source
files based on static analysis.
23

2. Execute the pipreqsnb tool [19] on each repository associated with the
Python language. This tool is publicly available on Github and con-
structs dependencies for Python and Python notebook source files based
on static analysis.
3. Remove from the dependencies obtained at steps 1 and 2 those already
found by sbom-tool (independently from their versions).
4. For each Python repository, add the dependencies found at step 2 to
the requirements.txt files, execute the sbom-tool again and retain the
dependencies from (package-manager, package-name) pairs that had not
been found by sbom-tool. This make it possible to collect a further
degree of transitive dependencies.
The analyses 1-4 are executed by skipping all files and folders whose names
contain one of the following keywords: development, optional, enhances, sug-
gests, build, configure, test, develop, dev, example, doc (in order to not take
into account any dependencies that are either necessary only for development
or that are optional).
In summary, two lists of dependencies are constructed for each repository:
• manifest dependency: those obtained by the sbom-tool at Step 1.
These dependencies may involve packages of any ecosystem supported
by sbom-tool.
• parsed dependency: those obtained by check-imports, pipreqsnb and a
further sbom-tool execution at Step 2. These dependencies may involve
only JavaScript, TypeScript and Python packages.
The content of the two lists is disjoint by construction. It is chosen to not
distinguish between direct and transitive dependencies.
Then, each repository is analysed with the Grype tool available on GitHub [20].
This tool scans a set of source files in search of dependencies and lists the
known vulnerabilities of the dependencies found. The input source files may
actually be specified also as a SBoM, in which case the packages are obtained
from the SBoM directly. In detail, the following steps are executed:
• For each SBoM s found in step 1, analyse s with Grype.
• For each dependency p found in step 2, search the vulnerabilities of p in
the Grype database.
Moreover, the first step was executed twice with different parameters, resulting
in two different sets of vulnerabilities:
• grype: set of vulnerabilities obtained from the execution of Grype.
24

• grype cpe: set of vulnerabilities obtained from the execution of Grype
using the add-cpes-if-none parameter. This parameter attempts to
generate CPE information (CPE is a structured naming scheme for in-
formation technology systems [21]) if not present in the SBoM. In partic-
ular, this implies that if the CPE identifier for a dependency is present in
the SBoM, then running Grype with or without the add-cpes-if-none
parameter would lead to the same set of vulnerabilities.
In addition, for all vulnerabilities collected with Grype, an attempt was made
to obtain other metadata not provided by Grype, using one of the following:
• CVE API provided by the NIST National Vulnerability Database [22].
• Scraping of GitHub Advisory Database [23] website pages.
Finally, a different set of vulnerabilities called osv api is obtained using the
API of OSV [24], a distributed vulnerability database for Open Source. More
specifically, the API is queried for each tuple (package manager, package name,
version) and the vulnerabilities found are stored separately from those ob-
tained with Grype.
Details on the vulnerability data sources used by Grype and OSV API can be
found in the respective documentation.
25

4.2 Database
Figure 11: Relational database schema realised for data storage.
All the data collected concerning GitHub organizations, dependencies and vul-
nerabilities are stored in a relational database. The schema of this database
can be seen in Figure 11.
The relational database tables are summarised below:
• organization. The table defines a GitHub organization. It contains the
collected properties for that organization. The primary key url is the
GitHub profile url.
26

• repository. The table defines a GitHub repository. It contains the
collected properties for that repository. The primary key url is the url
of the GitHub repository. The field organization is a foreign key that
refers to the table organization.
• user. The table defines a GitHub user. It contains the username and
creation date of the user’s profile. The primary key user name is the
username of the GitHub user.
• contributor. The table defines the contribution relationship between
a GitHub user and a repository. It contains the properties of this re-
lationship, such as the number of contributions, the number of rejected
pull requests and the maximum number of commits made on a certain
day. The primary key is the (user name, repository) pair, where the
user name field is a foreign key that refers to the user table and the
repository field is a foreign key that refers to the repository table.
• package. The table defines the package entity as the tuple (package
manager, package name, version). The primary key purl is the purl
(package url) [25] of the package.
• manifest dependency. The table defines the dependency relationship
between a repository and a package. This dependency relation belongs
to the dependency list manifest dependency (defined in subsection 4.1).
The primary key is the (repository, package) pair, where the repository
field is a foreign key that refers to the repository table and the package
field is a foreign key that refers to the package table.
• parsed dependency. The table defines the dependency relationship
between a repository and a package. This dependency relation belongs
to the dependency list parsed dependency (defined in subsection 4.1).
The primary key is the ( repository, package) pair, where the repository
field is a foreign key that refers to the repository table and the package
• vulnerabilility. The table defines a vulnerability. It contains the pri-
mary properties collected for that vulnerability. The primary key id is
the id of the vulnerability. The vulnerability id does not have a fixed
structure but depends on the origin of the vulnerability. In detail, ids
could start with:
– CVE: vulnerability comes from one of the publicly available vul-
nerability data sources used by Grype and it has been assigned a
CVE ID number to that vulnerability.
– GHSA: vulnerability comes from GitHub Advisory Database [23].
In particular vulnerabilities found that come from GitHub are GitHub-
reviewed advisories. These advisories are vulnerabilities that have
been mapped to packages in ecosystems that GitHub supports (fur-
ther information can be found in GHSA documentation [26]).
27

– GO: vulnerabiltiy comes from Go Vulnerability Database [27].
– OSV: vulnerability comes from OSS-Fuzz Database [28].
– PYSEC: vulnerability comes from PyPI Advisory Database [29].
– RUSTSEC: vulnerability comes from Rust Advisory Database [30].
• vulnerabilility metadata. The table defines metadata of a vulnera-
bility. It contains all the properties collected for that vulnerability. The
primary key id is a foreign key that refers to the vulnerability table.
• grype potential affection. The table defines the affection relationship
between a package and a vulnerability. This affection relation belongs to
the grype set of vulnerabilities (defined in subsection 4.1). The primary
key is the (vulnerability, package) pair, where the vulnerability field is a
foreign key that refers to the vulnerability table and the package field is
a foreign key that refers to the package table.
• grype cpe potential affection. The table defines the affection rela-
tionship between a package and a vulnerability. This affection relation
belongs to the grype cpe set of vulnerabilities (defined in subsection 4.1).
The primary key is the (vulnerability, package) pair, where the vulnera-
bility field is a foreign key that refers to the vulnerability table and the
package field is a foreign key that refers to the package table.
• osv api potential affection. The table defines the affection relation-
ship between a package and a vulnerability. This affection relation be-
longs to the osv api set of vulnerabilities (defined in subsection 4.1). The
primary key is the (vulnerability, package) pair, where the vulnerability
field is a foreign key that refers to the vulnerability table and the package
It is noted how the defined structure makes it possible to handle the fact
that a vulnerability may be associated with more than one (package manager,
package name, package version) tuple. In detail, the tables vulnerability and
package represent a single vulnerability and a single (package name, package
version) pair respectively. The potential affection tables, on the other hand,
represent the relationship between a vulnerability and a pair, where a vulner-
ability may affect several pairs.
Take as an example the case of a vulnerability v collected with Grype (without
the add-cpes-if-none parameter) for different versions of a package x . In
this case:
• Assume that, after the execution of the sbom-tool over several reposi-
tories, a record (package manager, package name, package version) for
each version x1, . . . xn of x has been inserted in the table package. Let
these records be called ̃
x1, . . . ̃
xn.
• Assume that, after the execution of Grype over the SBoMs built in
the previous step, the vulnerability v was found with Grype for all the
pairs ̃
x1, . . . ̃
xn . A record for the vulnerability is inserted in the table
vulnerability. Let this record be called ̃
v.
28

• A record is inserted in the grype potential affection table for each pair
(̃
v, ̃
xi) for each i ∈ {1, . . . , n}.
organization repository contributor
1 315 75 407 33 718
user package vulnerability
13 685 62 758 3 762
manifest dependency parsed dependency grype cpe potential affection
559 586 6 649 15 300
osv api potential affection grype potential affection vulnerability metadata
11 209 9 989 3 633
Table 2: Cardinalities of the tables of the database.
Table 2 shows the cardinality of the database tables. While the organization
and repository tables contain entries from all GitHub organizations in the
GitHub and Government list (see subsection 3.1), all other tables contain
data from the IT, DE, UK, US datasets only.
4.3 Creation of SBoMs: critical issues
4.3.1 Conditional dependencies
Dependencies for a software artifact often have to be specified in a conditional
way, that is, depending on specific properties of the target environment in
which the artifact will be built or executed. The rules for specifying such con-
ditional dependencies usually depend on the ecosystem. Two examples related
to the PyPI and Maven ecosystems are provided below.
Example 1 The requirements.txt dependencies file for PyPi (Python), as per
the PEP 508 [31], allow the following notation for conditional dependencies:
requirements.txt
numpy==1.20.3; python_version<"3.6" and sys_platform!="linux"
In this case, version 1.20.3 of the numpy package is required only if the envi-
ronment is not Linux and the version of Python used is lower than 3.6.
Example 2 The specification of Project Object Model pom.xml file for Maven
(Java) [32] defines build profiles to handle equivalent but different parameters
for a set of target environments. Below is an example of a part of a pom.xml
file that uses build profiles.
pom.xml
<dependencies>
<dependency>
29

<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-logging</artifactId>
<version>3.12.1</version>
</dependency>
</dependencies><profiles>
<profile>
<id>profile_1</id>
<dependencies>
<dependency>
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-json</artifactId>
</dependency>
</dependencies>
</profile>
<profile>
<id>profile_2</id>
<dependencies>
<dependency>
<groupId>io.spray</groupId>
<artifactId>spray-json_3</artifactId>
</dependency>
</dependencies>
</profile>
</profiles>
In this case, a global dependency (google-cloud-logging) and two build profiles
are defined in the pom.xml file. Depending on the profile used, the required
dependency will either be jersey-json (profile 1) or spray-json 3 (profile 2).
When handling conditional dependencies, sbom-tool (more precisely, the com-
ponent detection tool) adopts a predefined behavior that depends on the
ecosystem and, most importantly, omits any conditional dependency in the
generated SBoM. With reference to the above examples:
• In handling the requirements.txt file of Example 1, sbom-tool ignores the
dependency conditions, and adds the numpy package to the dependen-
cies, irrespective of the current Python version or the current system
platform.
• In handling the pom.xml file of Example 2, sbom-tool ignores the pres-
ence of build profiles, adding only the google-cloud-logging package to
dependencies.
The fact that the generated SBoM does not describe conditional dependencies
could be a limitation of the SBoM standard. As per the NTIA guidelines [33],
a SBoM could be created for each software component at the moment it is
built, packaged or delivered. This implies that when a software is delivered as
source code, it may have a SBoM. Moreover, if it is delivered as source code
it may have conditional dependencies, that have to be stored in the SBoM.
30

It is not clear how information on conditional dependencies should be stored
when creating SBoMs that follow standards such as Cyclone DX and SPDX.
More precisely, to the best of the author’s understanding:
• The CycloneDX standard has no specific attributes for the insertion of
conditional dependencies. This information can be inserted into the
SBoM only as a non-machine readable comment.
• The SPDX 2.2 standard has no specific attributes for the insertion of
conditional dependencies. This information can be inserted into the
SBoM only as a non-machine readable comment. Moreover from the
SPDX 2.2 documentation [34], it is possible to define relationships be-
tween the software and its dependencies. The closest relationship type
that can represent conditional dependencies is the type of optional de-
pendencies. Dependencies are optional ones only when building the code
will proceed even without them. The description of the relationship type
OPTIONAL_DEPENDENCY_OF say:
Use when building the code will proceed even if a dependency
cannot be found, fails to install, or is only installed on a specific
platform.
It is noted, however, that although there could be conditional depen-
dencies that are optional ones, conditional and optional dependencies do
not represent the same concept.
It is also reported how the SPDX documentation concerning the attribute
“External reference comment field”, is slightly different between web [34] and
pdf [35] versions of the documentation. In particular, the pdf version uses the
term “conditional” to describe the field Cardinality, while it would be more
correct to use the term “optional”.
4.3.2 Version constraints
The structure and stringency of the files that store dependencies depends on
the ecosystem. For example, the requirements.txt file for PyPi allows a depen-
dency to be added without specifying its version. In this case, the dependency
is collected by the sbom-tool using the latest version released on PyPi as the
version (this is also the case when transitive dependencies are obtained by
querying PyPi’s index). In most other cases, the specifications on manifest
files structure are much more stringent. In particular, manifest files can be
divided into two groups:
• Files that allow dependencies to be specified without fixing a version,
using specific syntaxes (e.g. >=, caret, tilde) to define specific version
ranges based on the semantic versioning system.
• Files that require to specify the version of a dependency used; these
include the lock file category: files that contain a snapshot of all the
31

dependencies used (both direct and transitive) with their versions. De-
pendencies are said to be locked.
In general, the SBoM concept requires dependencies to be locked. Given that,
a manifest file with dependencies of the form >= could lead to a SBoM where
the versions of the dependencies are the latest versions released at the time
the SBoM was built. However, this may not be the appropriate behaviour:
it would be more meaningful to store the last version at the time of the last
update of the software artifact (e.g. the last commit of a GitHub repository
or the date of last publication for a package).
Now take the following example concerning the creation of a SBoM for a
GitHub repository.
Example 1 Repository X is created on 17 May 2022. It contains some code
files and the requirements.txt file, with the following content:
requirements.txt
tensorflow-gpu >= 2.9.0
The author uses tensorflow-gpu package with the latest version released, 2.9.0
(released on 16 May 2022). The author no longer updates the repository.
The sbom-tool tool is executed on 7 September 2022. The tensorflow-gpu
package is collected with the latest version released, the 2.10.0 (released on 6
September 2022).
In this case, it is not possible to find the CVE-2022-36-011 vulnerability, which
affects version 2.9.0 but not 2.10.0 of the package tensorflow-gpu.
In this case, what is missing is a standardized procedure for constructing
SBoMs for a given GitHub repository.
4.4 Vulnerabilities collection: critical issues
4.4.1 Grype and Java vulnerabilities
A first problem noted when using Grype was the non-identification of vulner-
abilities for Java packages. Although existing, no vulnerabilities related to
Java packages reported in the SBoMs built with sbom-tool were found when
running Grype.
This could be a compatibility problem between the SPDX SBoMs generated
with sbom-tool and Grype. In fact, the Grype documentation states:
Grype supports input of Syft, SPDX, and CycloneDX SBoM for-
mats. If Syft has generated any of these file types, they should
have the appropriate information to work properly with Grype. It
is also possible to use SBoMs generated by other tools with varying
degrees of success.
32

Interestingly, Java packages are the only packages collected for which all pack-
age names are in the form string/identifier, where string is the reverse
domain string of the organization that authored the package. For example:
package name = org.springframework/spring-tx
Although Grype seems to not be able to find Java vulnerabilities from SBoMs
constructed by sbom-tool, by manually querying the Grype database using
the full package names many Java vulnerability matches were found.
Given that, it is assumed that Grype is not able to extract the full name of a
Java package from the purl (package url) stored in a SBoM built by sbom-tool.
For the grype and grype cpe vulnerability sets (see subsection 4.1), it is there-
fore decided to add the vulnerabilities related to Java packages that are present
in the database used by Grype without the direct use of Grype. In detail, the
following steps are executed:
• Obtain the list l of all Java packages stored in the table package of the
database described in subsection 4.2.
• For each (package name, package version) Java pair p = (pn, pv) ∈ l,
search in the Grype database for Java vulnerabilities related to p. For
each vulnerability v found:
– Store v in the table vulnerability of the database described in sub-
section 4.2.
– Store the affection relation between vulnerability v and package p in
both tables grype potential affection and grype cpe potential affection
of the database described in subsection 4.2.
4.4.2 Grype false positive
Another problem encountered when using Grype concerns the use of the
add-cpes-if-none parameter. As described in subsection 4.1, the grype cpe
vulnerability set is obtained by using Grype with that parameter. Below is the
description of the add-cpes-if-none parameter in the Grype documentation:
Two things that make Grype matching more successful are inclu-
sion of CPE and Linux distribution information. If an SBoM does
not include any CPE information, it is possible to generate these
based on package information using the add-cpes-if-none flag.
The problem regarding Grype matching with add-cpes-if-none is that sev-
eral erroneous vulnerability-package associations were found. In these false
positives, the pattern is that a certain vulnerability related to a generic soft-
ware artifact with name X from ecosystem Y1 is associated with a package
with the same name X but from ecosystem Y2. Some examples are reported
below.
Example 1 Grype associates vulnerability CVE-2020-7791 with RubyGems
33

package i18n (Ruby) when the vulnerability is actually associated with the
package i18n for ASP.NET.
Example 2 Grype associates vulnerability CVE-2020-18032 with PyPi pack-
age graphviz (Python) when the vulnerability is actually associated with the
software Graphviz Graph Visualization Tools.
Example 3 Grype associates vulnerability CVE-2020-15133 with npm pack-
age faye-websocket (JavaScript) when the vulnerability is actually associated
with the RubyGems package faye-websocket (Ruby).
Taking this problem into account (and assuming there may be other false
positives), the author of this work decided to use a limited set of vulnerabil-
ities for the analysis in the next section. In detail, it is chosen to use all the
vulnerabilities present in all 3 vulnerability sets: grype, grype cpe and osv api.
(a) Vulnerabilities (b) Affections
Figure 12: Venn diagrams specifying how many vulnerabilities (a) and affections
(b) are present in which set.
Figure 12 shows the Venn diagrams of the number of vulnerabilities and af-
fections (vulnerability-dependency pairs) for the 3 sets of vulnerabilities. The
number of vulnerabilities that will be considered later will therefore be 1 778.
These 1 778 vulnerabilities produce a total of 9 020 affections.
4.5 Dependency analyses
The analyses of the dependencies found are now presented. As specified in
subsection 4.1, dependencies are collected for the 4 datasets IT, DE, UK and
US described in subsection 3.2.
34

4.5.1 Packages and dependencies: Viewpoints of package man-
agers, datasets and repositories
(a) IT (b) DE
(c) UK (d) US
Figure 13: Distribution of (package name, package version) pairs among the
various package managers for datasets IT,DE,UK and US.
Figure 13 shows, for each dataset, the distribution of all the (package name,
package version) pairs found among the various package managers.
It can be seen that, although JavaScript is the most common language only for
the repositories of dataset DE (see Figure 9), in all dataset the vast majority
of packages come from npm. This large number of JavaScript packages is not
surprising: JavaScript projects are known for their large number of dependen-
cies. This is mainly due to the absence of a standard library and the related
intensive use of third-party packages by developers. This has created a very
large ecosystem: npm registred packages are more than all the packages of
Maven, PyPi and RubyGems put together [36].
35

Moreover, Node.js dependency management mechanism allows different ver-
sions of the same packages to coexist within the same project, increasing the
number of possible dependencies.
(e) IT (f) DE
(g) UK (h) US
Figure 14: Distribution of (package name, package version, repository) tuples
among the various package managers for datasets IT, DE, UK and US.
Figure 14 shows, for each dataset, the distribution of all the (package name,
package version, repository) tuples found among the various package man-
agers.
Here, npm is even more prominent: npm packages are on average used by more
repositories of the same dataset than packages from another package manager.
Interestingly, the UK dataset seems to be the one with the lowest percentage
of npm dependencies compared to total dependencies. This is due to the large
number of RubyGem dependencies present; Ruby is in fact the most popular
language in the UK dataset.
Furthermore, it is reported how all the dependencies, i.e. (repository, package)
36

pairs found, are distributed among the various datasets:
• 13.04% of dependencies found come from dataset IT (an average of 103
dependencies per repository).
• 40.48% of dependencies found come from dataset DE (an average of 176
• 30.78% of dependencies found come from dataset UK (an average of 112
• 15.69% of dependencies found come from dataset US (an average of 95
Datasets
Repos. lang. IT DE UK US
Python 14 6 14 9
JavaScript 12 18 26 27
Ruby 1 0 60 0
Go 0 1 18 2
Java 2 5 8 1
(a)
Datasets
Repos. lang. IT DE UK US
Python 12 6 14 6
JavaScript 10 10 12 23
Ruby 1 0 59 0
Go 0 0 9 0
Java 2 1 2 1
(b)
Table 3: Cross-language dependencies: For each language, number of repositories
with at least a dependency of a different language (Table (a)) and with at least a
dependency of that language and of a different language (Table (b)).
A cross-language dependency for a repository is defined as a dependency of a
different language from the one of the repository.
Table 3 shows the number of repositories with at least one cross-language
dependency (3-a) and with at least one cross-language dependency and one
dependency of the repository’s language (3-b).
In some subsequent dependency analysis, the repositories will be divided by
language; this division will not take into account the cross-language dependen-
cies of these repositories. However, from 3-a, it can be seen that the numbers
of repositories with at least one cross-language dependency are just a few com-
pared to the cardinalities of the datasets.
Finally, it can be seen that the data in 3-b are very similar to those in 3-a,
indicating that: when a repository has cross-language dependencies, it often
has also dependencies of its own language.
37

Figure 15: On the y-axis, the number of dependencies. On the x-axis, the
percentage of repositories of a target language that have at least y dependencies.
Languages taken into account are JavaScript, Java, Python, Ruby, Go.
Figure 15 shows the number of dependencies that come from a package man-
ager related to a target language per percentage of repositories of that language
that have at least that number of dependencies. Languages taken into account
are the 5 languages with the highest number of dependencies found across the
4 datasets. For example, the point (x,y) on the Ruby curve says that there
are x% of Ruby repositories that have at least y Ruby dependencies.
Once again, the anomalous behaviour of JavaScript with respect to the other
languages can be seen. In particular, it can be seen that almost 20% of the
JavaScript repositories have more than 1 000 dependencies.
Python on the other hand seems to be the language with the fewest number
38

of dependencies per repository, with only about 2% of Python repositories
having more than 100 dependencies.
4.5.2 Critical dependencies
Figure 16 shows the top 30 (package name, package version) pairs per per-
centage of repositories that depend on them for each dataset.
The first thing that can be seen is that almost all packages are JavaScript
packages for all 4 datasets. Since information about dependency trees are not
collected, it is not possible to distinguish between direct and transitive depen-
dencies in this study. However, it can be assumed that many of these packages
are present as transitive dependencies. This assumption is motivated by the
data obtained in Table 4.
(a) IT
(b) DE
39

(c) UK
(d) US
Figure 16: Top 30 (package name, package version) pairs by percentage of
repositories that depend on them for the datasets IT, DE, UK, US.
Table 4 contains a set of properties per the top 10 (package name, package
version) Javascript pairs per repositories that depend on them. The reposito-
ries taken into account are the set of repositories of all 4 datasets. In fact, it
can be seen that most of the packages in the top 10 of Table 4 are also present
in the top 30 of each dataset (Figure 16). In detail, the properties specified
by Table 4 are: the percentage of repositories that depend on the package,
the number of dependent packages, the number of dependencies and the last
existing version.
40

Name Repos. (%) Dependent Dependencies L. Version
wrappy 1.0.2 10.40 1 159 0 1.0.2
once 1.4.0 10.35 2 175 1 1.4.0
concat-map 0.0.1 10.26 1 149 0 0.0.2
inflight 1.0.6 10.22 1 124 2 1.0.6
path-is-absolute 1.0.1 10.20 1 513 0 2.0.0
fs.realpath 1.0.0 10.18 1 109 0 1.0.0
util-deprecate 1.0.2 9.58 947 0 1.0.2
escape-string-regexp 1.0.5 9.49 3 217 0 5.0.0
isexe 2.0.0 9.40 971 0 2.0.0
isarray 1.0.0 9.29 1 060 0 2.0.5
Table 4: Top 10 JavaScript (package name, package version) pairs per number of
repositories that depend on them. The repositories taken into account are the set of
repositories of all 4 datasets. For each package is reported the percentage of
repositories that depend on it, the number of dependent packages, the number of
dependencies and the latest existing version.
From Table 4 it can be noted that 4 out of 10 packages are not updated to
the latest version. Among the 10 packages in the table, there is one, path-is-
absolute, that is deprecated. As can be seen from Figure 16, the deprecated
package path-is-absolute is used by 8.41%, 18.48%, 11.11%, 9.61% of all the
repositories of the datasets IT, DE, UK, US respectively.
4.5.3 Manifest vs Parsed dependencies
JavaScript Python Ruby Java Go
Dataset man. par. tot. man. par. tot. man. man. man.
IT 5.93 3.14 7.33 10.77 16.84 22.56 0.99 10.42 4.26
DE 27.4 9.95 30.89 11.62 12.76 17.51 0.79 24.31 2.13
UK 22.34 6.28 24.26 20.71 16.16 25.08 54.96 9.72 36.88
US 12.74 7.16 15.18 6.73 6.9 10.77 5.75 0.69 11.35
Table 5: Percentages (%) of JavaScript, Python, Ruby, Java and Go repositories
for which at least one manifest dependency (man), parsed dependency (par) and any
dependency (tot) is found.
In Table 5 the percentages of JavaScript, Python, Ruby, Java and Go repos-
itories for which at least one dependency is found are presented. Due to the
way dependencies are collected (see subsection 4.1), the database constructed
contains parsed dependencies only for JavaScript and Python.
It is observed that a repository that contains code could have no dependencies
of the manifest dependency category for one of the following reasons:
• No third-party dependencies are used.
• The third-party dependencies used are not specified in the dependency
files.
• Dependency files are not present or are malformed.
41

Table 5 shows that the highest percentage of repositories with at least one de-
pendency found in the dependency files for JavaScript repositories is obtained
by the DE dataset, with more than 27% of JavaScript repositories for which at
least one dependency is found. The DE dataset also has the highest percentage
of JavaScript repositories for which at least one parsed dependency is found.
For the IT dataset, on the other hand, at least one manifest dependency is
found for less than 6% of the JavaScript repositories.
The Python repositories in the IT dataset seem to have the highest number of
repositories that have dependencies not specified in the dependency files (this
is the case for more than the 16% of the Python repositories in the dataset).
Finally, it is noted that the UK dataset has the highest percentage of Ruby
and Go repositories for which at least one dependency was found: 54.96% and
36.88% of Ruby and Go repositories respectively.
In addition, looking at the data in both Figure 16 and Figure 9, one can
see that:
• 10.52% of the repositories in the IT dataset are JavaScript or TypeScript
repository. More than 8% of the repositories in the IT dataset have at
least one npm dependency.
• 23.92% of the repositories in the DE dataset are JavaScript or TypeScript
repository. More than 18% of the repositories in the DE dataset have at
• 12.01% of the repositories in the UK dataset are JavaScript or Type-
Script repository. More than 11% of the repositories in the UK dataset
have at least one npm dependency.
• 12.80% of the repositories in the US dataset are JavaScript or TypeScript
repository. More than 9% of the repositories in the US dataset have at
However, Table 5 shows that for the majority of the JavaScript repositories of
all the datasets, no dependencies are found: this indicates that many of the
repositories that depend on packages in Figure 13 are not JavaScript reposi-
tories.
Finally, data in Table 6 are reported. Table 6 contains the percentages of
repositories that have no manifest dependencies that come from a given pack-
age manager compared to repositories that have at least one parsed depen-
dency that come from that package manager. These data is intended to answer
the question: in cases where the developers do not specify at least one depen-
dency, do they not specify just one, or do they not specify all of them?
The results show that Italian developers are the ones who do not specify all
of them the most: for almost 70% of the repositories in the IT dataset for
which a parsed npm/pypi dependency was found, no npm/pypi dependencies
are specified in the manifest files.
42

Dataset
Package Manager IT DE UK US
npm 68.75 43.10 50.00 42.31
pypi 69.81 47.37 30.21 63.41
Table 6: Percentages of repositories that have no npm/pypi manifest dependencies
with respect to repositories that have at least one npm/pypi parsed dependency.
4.6 Vulnerability analyses
The vulnerabilities found for all repositories of the four datasets IT, DE, UK
and US are now analysed.
In this work, a vulnerability associated with a repository is defined as a vul-
nerability associated with a package used by that repository. It is noted that
the fact that a package is vulnerable does not imply that a repository using
that package is vulnerable. It could in fact be that the repository does not use
the vulnerable units of the package. Moreover, the way a package is used may
make it impossible to exploit the vulnerability. It is therefore said that the
existence of a vulnerability associated with a repository makes the repository
potentially vulnerable.
A total of 3 792 vulnerabilities were collected. Among these, 3 576 have a
CVE identifier (see subsection 4.2). CVE Identifier for a given publicly known
vulnerability is an identifier of that vulnerability published by the CVE (Com-
mon Vulnerabilities and Exposure) system. Entities that can publish a CVE
record are called CVE Numbering Authorities (CNAs).
The CVE system defines the CVSS (Common Vulnerability Scoring System
CVSS), an open set of standards used by many organizations such as NVD,
IBM or Oracle to assign a severity measure to a vulnerability. CVSS as-
signs a severity score from 1 to 10 to a given vulnerability. This score takes
into account certain parameters such as the vulnerability’s exploitability, the
vulnerability’s scope and the impact of an attack in case the vulnerability
is successfully exploited. More information can be found in the documenta-
tion [37].
Depending on the assigned score, a vulnerability is categorized into one of the
following severity classes:
• Low: severity in range 0.1-3.9.
• Medium: severity in range 4.0-6.9.
• High: severity in range 7.0-8.9.
• Critical: severity in range 9.0-10.0.
All vulnerabilities taken into account that have not a CVE identifier are
GitHub GHSA vulnerabilities (see subsection 4.2). GitHub is a CNA, and
usually provide CVSS base metrics for vulnerabilities in its database, even if
a CVE identifier has not been assigned yet to these vulnerabilities. In detail,
all collected GitHub GHSA vulnerabilities have a severity class.
43

In Figure 17 the distribution of vulnerability’s severities among all the vulner-
abilities taken into account is presented.
It can be seen that more than 40% of vulnerabilities have high severity. The
second most present severity class is medium, followed by critical and low.
Figure 17: Distribution of vulnerability’s severities among all the vulnerabilities
found for the 4 datasets IT, DE, UK and US.
4.6.1 Viewpoint of repositories
Now vulnerabilities found are analysed from the viewpoint of the repositories.
Figure 18: Percentage of repositories that have at least one high or critical
vulnerability for datasets IT, DE, UK, US.
Figure 18 shows the percentage of repositories associated with at least one high
or critical vulnerability for the 4 datasets. It can be seen that the UK dataset
has the highest percentage: almost 40% of the repositories in the dataset use
a package for which a vulnerability with a score of 7 or higher is found.
Figure 19 shows the distributions of the number of critical and high vulnera-
bilities found in each repository; percentage values are computed with respect
to the number of repositories with at least one vulnerability high or critical
(see Figure 18).
It can be seen that for all datasets, most of the repositories are in the range
44

(0,2) and (0,4) critical and high vulnerabilities, respectively. The IT dataset
has the highest percentage of repositories associated with more than 11 criti-
cal vulnerabilities and 44 high vulnerabilities (the 22.68% of the repositories
taken into account). By focusing on ranges with more than 11 and more than
44 critical and high vulnerabilities (cells in the upper right region of each
map), it turns out that IT has the highest percentage of repositories in such
ranges: 22.68% as opposed to 10.12%, 7.68%, 4.68% of datasets DE, UK and
US respectively. Moreover, 1.68% of Italian repositories taken into account
are associated to 18 critical vulnerabilities or more and 70 high vulnerabilities
or more (the cell in the upper right corner of the map).
(a) IT
(b) DE
45

(c) UK
(d) US
Figure 19: Heat maps of the number of high and critical vulnerabilities. The value
in each square is the percentage of repositories associated with a number of high and
critical vulnerabilities in those ranges. Percentage values are computed only over the
repositories associated with at least one high or critical vulnerabilities (see
Figure 18).
An unresolved vulnerability v is defined as a vulnerability for which no version
of the vulnerable package without v has been released.
Now only the subset of unresolved vulnerabilities (at the data collection date,
October 11th, 2022) is analysed. Among all the vulnerabilities taken into ac-
count, only the 14.46% of them are marked as unresolved.
Figure 20 shows the percentage of repositories associated with at least one
46

unresolved high or critical vulnerability for the 4 datasets. It can be seen
that, as in Figure 18, the UK dataset has the highest percentage. The results
in Figure 20 and Figure 18 differ by a few percentage points for all datasets,
indicating that most repositories associated with at least one critical or high
vulnerability are associated with an unresolved vulnerability.
Figure 20: Percentage of repositories associated with at least one unresolved high
or critical vulnerability for datasets IT, DE, UK, US.
Figure 21 shows the distributions of the number of unresolved critical and high
vulnerabilities found in each repository; percentage values are computed with
respect to the number of repositories with at least one unresolved vulnerability
high or critical (see Figure 20).
It can be seen that for all datasets, more than 60% of the repositories are in
the range (0,2) and (0,4) unresolved critical and high vulnerabilities, respec-
tively.
Only the DE dataset has repositories associated with more than 5 unresolved
critical vulnerabilities: for the 1.44% of the repositories a number of unre-
solved critical vulnerabilities between 6 and 8 and a number of unresolved
high vulnerabilities between 10 and 14 is found.
(a) IT
47

(d) US
Figure 21: Heat maps of the number of unresolved high and critical vulnerabilities.
The value in each square is the percentage of repositories that are associated with a
number of unresolved high and critical vulnerabilities in those ranges. Percentage
values are computed only over the repositories associated with at least one unresolved
high or critical vulnerabilities (see Figure 20).
It is then analysedd the distribution of critical or high vulnerabilities among
repositories. Figure 22 shows the number of critical or high vulnerabilities per
percentage of repositories with at least that number of vulnerabilities. Per-
centages are computed over the total number of repositories of each dataset.
As the percentage of repositories decreases, the number of vulnerabilities in-
creases exponentially for each dataset. One can also see that while for the IT,
DE and UK datasets the percentage of repositories associated with more than
40 vulnerabilities is more or less 4%, for the US dataset it is less than 2%.
49

Figure 22: Number of critical or high vulnerabilities per percentage of repositories
with at least that number of vulnerabilities. Percentages are computed over the total
number of repositories of each dataset.
4.6.2 Viewpoint of packages
Now vulnerabilities found are analysed from the viewpoint of the packages.
Figure 23: Distribution of vulnerabilities found among the various package
managers.
Figure 23 shows the distribution of vulnerabilities found among the various
package managers.
It can be seen that most of the vulnerabilities come from packages of pypi;
immediately afterwards npm can be found, followed by maven and gem. Less
than 1% of the vulnerabilities came from cargo or golang.
However, Figure 23 does not take into account the frequency of use of each
package manager. Table 7 shows mean and variance values of the number of
vulnerabilities per package for each package manager. It can be seen that,
although more npm vulnerabilities were found than maven and gem ones, on
50

average a maven or a gem package is more vulnerable than an npm package.
It can also be seen that the variance values of the distributions for pypi and
maven are the highest ones; this indicates that there are few packages with
many vulnerabilities.
In particular, the variance in pypi is even greater than 90. This is explained by
the presence of 3 (package name, package version) pypi pairs that are related
to the open source platform TensorFlow:
• (tensorflow, 1.14.0)
• (tensorflow, 2.4.0)
• (tensorflow-gpu, 1.14.0)
These packages provide Python APIs for using the TensorFlow platform, and
all the TensorFlow vulnerabilities (more than 300) are associated with each
of them. If these packages were removed from the analysis, the mean and the
variance of the number of vulnerabilities per pypi package would become 0.23
and 2.54 respectively.
Vulnerable pairs Vulnerabilities per pair
Package manager % Average Variance
pypi 7.35 0.51 92.86
npm 3.23 0.06 0.18
maven 8.23 0.31 6.28
gem 11.44 0.35 2.66
Table 7: Percentage of (package name, package version) pairs found to be vulnerable
(percentages are computed over the total number of packages found for that package
manager). Average and variance of the number of vulnerabilities per pair.
In addition, Table 7 gives the percentage of pairs (package name, package ver-
sion) with at least one vulnerability for each package manager. The percent-
ages are calculated over the total number of packages found for each package
manager.
It can be seen that the percentages just reported for npm and maven are simi-
lar to the data reported by Sonatype: according to their 2021 SSSC Report [7],
8.4% of the pairs (package, version) housed in the maven central and 2.2% of
those housed in npm contain at least one known vulnerability.
However the SSSC Report states that only 0.5% of the pairs (package,version)
housed in the pypi repository contain at least one vulnerability: therefore, on
average, the pypi packages used by the 4 datasets are much more vulnerable
than the average of all packages housed in the pypi repository.
Severity
Total >= Low >= Medium >= High Critical
62 710 3 073 (4.90%) 3 004 (4.79%) 2 104 (3.35%) 637 (1.02%)
Table 8: Total number of (package name, package version) pairs found and number
of pairs found that have at least one vulnerability with severity higher then: high,
medium, low or critical.
51

Table 8 shows the number of (package name, package version) pairs found
with at least one vulnerability with a severity higher than: high, medium, low
or critical.
Vulnerability ID Severity Package Pairs Repos.
CVE-2022-3517 High minimatch (npm) 6 443
CVE-2021-43809 High bundler (gem) 44 392
CVE-2021-44906 Critical minimist (npm) 10 374
GHSA-2qc6-mcvw-92cw Medium nokogiri (gem) 51 351
CVE-2020-28469 High glob-parent (npm) 5 303
CVE-2020-36327 High bundler (gem) 21 276
CVE-2021-3918 Critical json-schema (npm) 1 263
CVE-2022-30122 Medium rack (gem) 33 263
CVE-2022-30123 High rack (gem) 33 263
CVE-2022-29181 High nokogiri (gem) 48 260
Table 9: The table reports the top 10 vulnerabilities per number of potentially
vulnerable repositories. The table shows: the severity of the vulnerability, the
vulnerable package, the number of the (package name, package version) pairs affected
and the number of potentially vulnerable repositories.
Table 9 shows the top 10 vulnerabilities per number of potentially vulnerable
repositories. The repositories considered are those of all 4 datasets IT, DE,
UK and US.
Each vulnerability is associated with several versions of a single package. It
can be seen that most of the vulnerabilities have high severity and are related
to JavaScript or Ruby packages. Moreover, each of the top 5 vulnerabilities in
the table makes more than 300 repositories potentially vulnerable. In addition,
the following data are reported:
• Each of the top 80 vulnerabilities per number of potentially vulnerable
repositories make more than 100 repositories vulnerable. Moreover, all
of them come from JavaScript or Ruby packages.
• Together, vulnerabilities CVE-2022-3517, CVE-2021-43809 and GHSA-
2qc6-mcvw-92cw (first, second and fourth position in Table 9 respec-
tively) make potentially vulnerable 880 out of 4551 repositories (almost
the 20%).
Figure 24 shows the percentages of pairs with a number of high and critical
vulnerabilities in the indicated ranges. The percentages are calculated over
the number of pairs with at least one vulnerability of severity high or higher.
It can be seen that almost 50% of the vulnerable pairs have only one vulner-
ability with high severity, and just over 16% have only one vulnerability with
critical severity. Moreover, only the 6.5% of the vulnerable pairs taken into
account have more than 1 critical vulnerability, while the 28.8% have more
than 1 high vulnerability.
The results in Figure 24 change slightly when only unresolved vulnerabili-
ties are taken into account: 67.44% of packages with at least one unresolved
high or critical vulnerability have exactly one unresolved critical vulnerability.
Among the others, 14.05% have exactly one high unresolved vulnerability.
52

Figure 24: Heat maps of the number of high and critical vulnerabilities. The value
in each square is the percentage of (package name, package version) pairs that have
a number of high and critical vulnerabilities in those ranges. Percentage values are
computed only over the (package name, package version) pairs with at least one high
or critical vulnerabilities (see Table 8)
.
53

5 Discussion and conclusions
In this work, an attempt was made to use a SBoM building tool to systemati-
cally create SBoMs for a set of GitHub repositories. Then, these SBoMs have
been used to obtain information on supply chain dependencies and to perform
vulnerability analyses.
A special focus was put on the data used: 4 datasets containing repositories
of the Italian, German, British and American public administration were con-
structed.
A rigorous process was defined in order to obtain the dependencies and vul-
nerabilities associated with a repository and then investigating the differences
between the 4 datasets.
Data of the 4 datasets come from a set of GitHub organizations related to
the public administration of different countries of the world. Among these
organizations, it was seen that the United States and the United Kingdom
have the highest number of organizations and repositories. It was also noted
that among the 16 countries with the most number of organizations, the most
popular languages are Python and JavaScript.
For the US dataset, the most popular type of repository is the one that does
not contain code. For the other datasets, IT, DE and UK, the most popular
repositories are Python, JavaScript and Ruby ones respectively.
During the process of creating the SBoMs, some critical issues emerged: the
tool for generating SBoMs does not allow for full management of conditional
dependencies, behaving differently depending on the ecosystem. In addition,
non-stringent version constraints of a package are solved by using the latest
version released at the time of SBoM creation, without defining a standardised
procedure to specify the date of last update of the software artifact.
Moreover, a problem in obtaining Java vulnerabilities for packages was en-
countered, due to a lack of full compatibility between the SBoM creation tool
and the tool used for vulnerability analysis.
All the dependencies found for the four datasets were then analysed; it was
seen that, on average, the DE dataset has the highest number of dependencies
per repository, while the US dataset has the fewest. It was noted that for all
datasets, the largest number of dependencies came from the JavaScript pack-
age manager npm.
The 30 packages for which the highest number of repositories depend were
taken into account, and it was seen that most of these are JavaScript packages
for all the 4 datasets. It was also observed that many JavaScript dependencies
come from non-JavaScript repositories.
The vulnerabilities collected were then analysed. It was seen that the UK
dataset, with the 40%, is the one with the highest percentage of repositories
54

associated with at least one high or critical vulnerability.
The IT dataset is the one with the highest percentage of repositories associ-
ated with a large number of high and critical vulnerabilities. In fact, 22.68%
of the repositories with at least one high or critical vulnerability have more
than 44 high vulnerabilities and more than 11 critical vulnerabilities.
Among all the vulnerabilities, the focus was then put on the subset of un-
resolved ones: an unresolved vulnerability v is defined as a vulnerability for
which no version of the vulnerable package without v has been released. De-
spite only the 14.46% of all the vulnerabilities found are unresolved, the per-
centages of repositories with at least one unresolved critical or high vulner-
abilities are only slightly lower than the percentages of repositories with at
least one critical or high vulnerability.
Taking into account vulnerabilities from the viewpoint of the packages, it
was seen that the package manager with the highest number of vulnerable
packages on average is the Ruby one. Infact the 11.44% of the (package name,
package version) pairs from RubyGems registered at least one high or critical
vulnerability. Furthermore, although only 4.90% of the pairs (package name,
package version) were found to be vulnerable, it is observed that only 3 pack-
age made almost 20% of all repositories potentially vulnerable.
The conclusion is drawn by observing how the use of SBoM standards can
be an effective way of systematically keeping track of the elements of a soft-
ware artifact’s supply chain; the presence of rigorous standards provides the
possibility of being able to analyse the dependencies and their vulnerabilities
in a programmatic and continuous way.
However, the SBoMs creation tool taken into consideration still presents crit-
icalities and there is not complete compatibility between the creation and
analysis tools used. In detail, it is not possible yet to define a procedure for
the creation and analysis of a SBoM that applies to a GitHub repository and
that is:
• Accurate: packages with non-fixed versions are inserted into the SBoM
with the latest version existing at the time the SBoM is built, not at the
time of the latest project update. Moreover, conditions of conditional
dependencies are ignored.
• Consistent: across different ecosystems. Conditional dependencies are
handled with a predefined behaviour different for each ecosystem.
• Complete: starting from the SBoM, vulnerabilities associated with
Maven packages are not found.
Finally, in the next paragraphs, open problems and possible future works are
discussed.
A limitation of a vulnerability analysis such as the one carried out in this
work is that the vulnerabilities found are potential vulnerabilities for reposi-
tories. A vulnerability in a package used by one of the analysed repositories
55

is a potential vulnerability for that repository. A potential vulnerability ac-
tually becomes a vulnerability when it is exploitable in the target repository;
whether the vulnerability is exploitable depends on the parts of the package
used and how they are used by the repository itself. Figuring out whether
or not a third-party vulnerability affects a repository is an open question for
defining the actual impact of that vulnerability.
Furthermore, it is important to understand whether the maintainers of a repos-
itory are aware of potential vulnerabilities in their repository. GitHub provides
the Dependabot [38] tool that generates alerts when it identifies a project de-
pendency with a vulnerability. To the best of the author’s research, there
appears to be no studies on the behaviour of maintainers, i.e. whether they
analyse the presence of vulnerabilities and their potential impact or not. In
addition, there seems to be no convention on how to indicate if a vulnerability
of a certain third-party dependency has been analysed or not. Understanding
whether the repository is actually vulnerable and providing this information
in a standardized way would be useful for developers that decide to use the
repository. Moreover, having this information would make vulnerability anal-
ysis more precise, allowing analysis to be limited to only those vulnerabilities
that can be exploited from the code contained in the repository.
An analysis that could be the subject of future studies concerns the life cycle
of a vulnerability with respect to the repository. Understanding whether a po-
tential vulnerability was already known at the time of the commit that caused
it may be important. Such an analysis would provide a better understanding
of the behaviour of maintainers: do they take potential vulnerabilities into
account when using third-party packages in their open source projects?
Another future work could concern differences in the number of vulnerabil-
ities across repositories. It might be interesting to understand whether these
differences are due to the number of third-party packages used, the type of
these packages or specific actions of the maintainers. Repositories that are less
vulnerable than others, for instance, could indicate maintainers analysing the
presence of potential vulnerabilities and removing vulnerable dependencies.
56

6 Appendix
This appendix contains descriptions of the software developed for data col-
lected and database construction. Each functional unit will be described in a
subsection.
The data collected by the functional units are then stored into a sqlite database
with the structure described in subsection 4.2. To do this, Python’s sqlite3
module is used. Each time a record is inserted into a table, if a record with
that primary key already exists, the insertion will be ignored.
It is also observed that the functional units described in this appendix do
not cover the entire code realised. In particular, the scripts that store data
in the database and the scripts that use functional units have been omitted.
Moreover also the scripts that collect vulnerability metadata have been omit-
ted, since the data obtained from them are not discussed in this work.
57

6.1 GitHub & Government list
Type: Script.
Language: Python.
Tools and external libraries used: BeautifulSoup [39], requests [40].
Input: GitHub & Government list web page URL.
Output: Python list of (username, section, category) tuples, one for each
organization in the GitHub & Government list.
Description: The Python script uses Python’s BeautifulSoup library to scrap
the GitHub & Government page and obtain the data described in subsec-
tion 3.1.
To obtain the web page, Python’s requests library is used; to obtain the user-
name and subsection name from the html content, two regular expressions are
used.
Below is a snippet with the code core:
URL = 'https://government.github.com/community/'# GitHub and Government url
resp = requests.get(URL) # Get webpage
soup = BeautifulSoup(resp.text, 'lxml') # BeautifulSoup initialization
orgs_names = soup.select('div.org-name')
orgs=list()
for on in orgs_names:
# '@([w,-]+)': Regex for GitHub username
username=re.compile(r'@([w,-]+)').search(on.text).groups()[0]
section=on.find_previous('h2').text.strip()
subsec=on.find_previous('h3').text
# '(.*)([0-9]+)': Regex for subsection name
subsec=re.compile(r'(.*)([0-9]+)').search(subsec).groups()[0].strip()
orgs.append((username,section,subsec))
58

Software bill of materials: strumenti e analisi di progetti open source dell’amministrazione pubblica

Software bill of materials: strumenti e analisi di progetti open source dell’amministrazione pubblica

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Software bill of materials: strumenti e analisi di progetti open source dell’amministrazione pubblica

Ähnlich wie Software bill of materials: strumenti e analisi di progetti open source dell’amministrazione pubblica (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Software bill of materials: strumenti e analisi di progetti open source dell’amministrazione pubblica