Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Software bill of materials: strumenti e analisi di progetti open source dell’amministrazione pubblica

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
University of Trieste
Department of Engineering and Architecture
Computer & Electronic Engineering
Master’s Thesis
Softwar...
Abstract
Durante il ciclo di vita di un software, sono molti gli elementi coinvolti nei
processi di realizzazione e distri...
Contents
1 Introduction 4
2 Supply-Chain Security in Open Source Software 5
2.1 Introduction to the problem . . . . . . . ...
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 81 Anzeige

Weitere Verwandte Inhalte

Aktuellste (20)

Anzeige

Software bill of materials: strumenti e analisi di progetti open source dell’amministrazione pubblica

  1. 1. University of Trieste Department of Engineering and Architecture Computer & Electronic Engineering Master’s Thesis Software bill of materials: tools and analysis of open source projects of the public administration Candidate: Federico Boni Supervisor: Prof. Alberto Bartoli Accademic Year 2021–2022
  2. 2. Abstract Durante il ciclo di vita di un software, sono molti gli elementi coinvolti nei processi di realizzazione e distribuzione dello stesso. Si pensi agli strumenti di sviluppo o alle librerie di terze parti utilizzate. Tutti questi elementi, insieme, definiscono la catena di fornitura del software, o supply chain. Quando si realizza un software, modularita ̀ e riutilizzo del codice sono pratiche efficaci per non affrontare problemi gia ̀ risolti. Queste pratiche portano all’ utilizzo di componenti esterni detti dipendenze, spesso open source (librerie software). Tuttavia, e ̀ importante notare che questi componenti sono soggetti a vulnerabilita ̀, e la presenza di queste vulnerabilita ̀ si riflette in potenziali rischi per il software stesso. Il monitoraggio della supply chain e ̀ quindi essen- ziale per le organizzazioni coinvolte nello sviluppo di software. Il riconoscimento dell’importanza di questo problema ha portato alla nascita degli SBoM. Si definisce Software Bill of Materials, o SBoM, un documento formale per tenere traccia di ciascuno dei componenti utilizzati all’interno di un artefatto software, ovvero gli elementi della supply chain. Questo studio utilizza tecnologie esistenti per la raccolta di dipendenze e la creazione di file SBoM per alcuni artefatti software. I componenti della supply chain presi in considerazione sono librerie provenienti da diversi ecosistemi di linguaggi di programmazione. Alla creazione degli SBoM segue poi un’analisi delle vulnerabilita ̀, per l’identificazione di potenziali vulnerabilita ̀ del software causate da elementi della supply chain. Gli artefatti software considerati sono progetti open source di agenzie governative di 4 paesi: si definiscono 4 dataset contenenti repository GitHub di Italia, Germania, Regno Unito e Stati Uniti. I risultati ottenuti rivelano alcune differenze tra i dataset, sia in termini di dipendenze da pacchetti software che di vulnerabilita ̀. Molte repository risul- tano dipendere da pacchetti di terze parti; si nota poi come un ristretto nu- mero di pacchetti presenti un alto numero di vulnerabilita ̀. Inoltre, si osserva come alcuni pacchetti vulnerabili siano ampiamente utilizzati fra le reposi- tory dei dataset, rendendo quest’ultime potenzialmente vulnerabili. Durante il processo di creazione ed analisi degli SBoM emergono poi alcune criticita ̀ nella gestione di alcune dipendenze e nella compatibilita ̀ fra gli strumenti di costruzione ed analisi utilizzati. In ultimo, dopo aver analizzato i risultati ottenuti e le limitazioni degli stru- menti utilizzati, si osserva come gli standard SBoM possano essere un modo efficace per tenere traccia dei componenti di un software e analizzare sistem- aticamente la presenza di vulnerabilita ̀. Tuttavia, si nota come non risulti ancora possibile definire una procedura per la creazione e l’analisi di SBoM per una repository GitHub che sia: accurata nella raccolta delle dipendenze, coerente nella gestione di ecosistemi differenti e completa nella raccolta ̀ delle vulnerabilita ̀. 1
  3. 3. Contents 1 Introduction 4 2 Supply-Chain Security in Open Source Software 5 2.1 Introduction to the problem . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Software ecosystems, dependencies and packages . . . . 5 2.1.2 Software supply chain . . . . . . . . . . . . . . . . . . . 6 2.1.3 Open source and supply chain . . . . . . . . . . . . . . 6 2.2 SBoM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 Software Bill Of Materials . . . . . . . . . . . . . . . . . 7 2.2.2 Standards . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.3 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Review of the Literature . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Top Five Challenges in Software Supply Chain Security: Observations From 30 Industry and Government Orga- nizations . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.2 An Empirical Comparison of Dependency Network Evo- lution in Seven Software Packaging Ecosystems . . . . . 9 2.3.3 Structure and Evolution of Package Dependency Networks 10 3 GitHub data collection 12 3.1 GitHub and Government . . . . . . . . . . . . . . . . . . . . . . 12 3.1.1 Organizations characteristics per country . . . . . . . . 13 3.1.2 Repositories characteristics per country . . . . . . . . . 16 3.2 The case study: 4 datasets for Italy, Germany, the United States and the United Kingdom . . . . . . . . . . . . . . . . . . . . . . 18 4 Methods and Analyses 23 4.1 SBoM creation and vulnerability detection . . . . . . . . . . . . 23 4.2 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3 Creation of SBoMs: critical issues . . . . . . . . . . . . . . . . 29 4.3.1 Conditional dependencies . . . . . . . . . . . . . . . . . 29 4.3.2 Version constraints . . . . . . . . . . . . . . . . . . . . . 31 4.4 Vulnerabilities collection: critical issues . . . . . . . . . . . . . 32 4.4.1 Grype and Java vulnerabilities . . . . . . . . . . . . . . 32 4.4.2 Grype false positive . . . . . . . . . . . . . . . . . . . . 33 4.5 Dependency analyses . . . . . . . . . . . . . . . . . . . . . . . . 34 4.5.1 Packages and dependencies: Viewpoints of package man- agers, datasets and repositories . . . . . . . . . . . . . . 35 4.5.2 Critical dependencies . . . . . . . . . . . . . . . . . . . . 39 4.5.3 Manifest vs Parsed dependencies . . . . . . . . . . . . . 41 4.6 Vulnerability analyses . . . . . . . . . . . . . . . . . . . . . . . 43 4.6.1 Viewpoint of repositories . . . . . . . . . . . . . . . . . 44 4.6.2 Viewpoint of packages . . . . . . . . . . . . . . . . . . . 50 5 Discussion and conclusions 54 2
  4. 4. 6 Appendix 57 6.1 GitHub & Government list . . . . . . . . . . . . . . . . . . . . 58 6.2 Database connector . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.3 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.4 Organization data (GitHub APIs) . . . . . . . . . . . . . . . . 61 6.5 Repository data (GitHub APIs) . . . . . . . . . . . . . . . . . . 62 6.6 Contributor data (GitHub APIs) . . . . . . . . . . . . . . . . . 63 6.7 SBoM (sbom-tool) . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.8 Parsed dependencies (Python) . . . . . . . . . . . . . . . . . . . 65 6.9 Parsed dependencies (JavaScript) . . . . . . . . . . . . . . . . . 66 6.10 Vulnerabilities data (Grype) . . . . . . . . . . . . . . . . . . . . 67 6.11 Vulnerabilities data (OSV API) . . . . . . . . . . . . . . . . . . 68 6.12 Figures (Bar charts) . . . . . . . . . . . . . . . . . . . . . . . . 69 6.13 Figures (Line charts) . . . . . . . . . . . . . . . . . . . . . . . . 71 6.14 Figures (Venn diagrams) . . . . . . . . . . . . . . . . . . . . . . 73 6.15 Figures (Pie charts) . . . . . . . . . . . . . . . . . . . . . . . . 74 6.16 Figures (Heat maps) . . . . . . . . . . . . . . . . . . . . . . . . 76 3
  5. 5. 1 Introduction The supply chain of a software is the set of elements of any type that have a role in the life cycle of a given software artifact. Cybersecurity attacks related to a software supply chain can be both attacks on third-party supply chain el- ements (known as supply chain attacks) or attacks that exploit vulnerabilities in third-party components. During software development, modularity and code reuse are effective prac- tices to avoid dealing with problems that have already been solved. Nowadays softwares are not created from scratch, but are a combination of various ex- ternal components called dependencies, often open source (software libraries). However, these components are subject to vulnerabilities and developers that uses them do not have full control over their code. It is important to note that the presence of vulnerabilities in the supply chain, is reflected in poten- tial risks for the software itself. Monitoring the elements of the supply chain is therefore essential for organizations involved in software development. This work considers the concept of SBoM (Software Bill of Materials), a for- mal record containing the list of components of a software, i.e. the elements of the software’s supply chain. The concept of SBoM began to be known when, on 12 May 2021, the President of the United States signed an Executive Or- der [1] with standards and best practices concerning the cybersecurity of the United States. That Executive Order recommended the use of SBoMs to track supply chain relationships of software applications. SBoM standards are specifically designed to be machine readable. This makes it possible to programmatically and constantly track the dependencies start- ing from a SBoM. Moreover the strength of using a machine-readable SBoM format lies in the possibility of defining formal procedures for analysed it: software vendors can use SBoMs as input for vulnerability analysis softwares capable of matching each third party component described in the SBoM with existing vulnerabilities. Here, a special focus is given to the dependencies that come from various pro- gramming language ecosystems. These dependencies are third-party libraries, modules or packages that can be installed using different package managers for each language. Also potential cybersecurity issues caused by vulnerabili- ties of these dependencies are taken into account. In detail, in this work, an attempt is made to build and analyse SBoMs starting from a set of GitHub repositories relative to the pubblic administration of different countries. The countries considered are Italy, Germany, the United Kingdom and the United States. This work reports the results obtained in terms of dependency and vulner- ability analysis of the GitHub repositories under consideration, as well as problems and critical issues related to the tools used. 4
  6. 6. 2 Supply-Chain Security in Open Source Software 2.1 Introduction to the problem 2.1.1 Software ecosystems, dependencies and packages The work in [2] defines a software ecosystem as: “A collection of software projects, which are developed and co-evolve in the same environment”. An environment can be defined by the components of an organization, by a community (e.g. open source communities) or by a programming language or framework. In this work a software ecosystem is intended as an ecosystem defined by a programming language and its package managers. Given a software artifact, a dependency of it is any other artifact (libraries, plugins, ...) that the software artifact requires in order to work as expected. These dependencies may either be parts of the software itself (i.e. developed by the same programmers) or be third-party components. In the latter case, it is distinguished between proprietary dependencies and open source depen- dencies. The set of dependencies of a software artifact defines the so-called dependency tree. Dependencies are usually divided into direct and transitive dependencies: • Direct dependency: A dependency d of a software artifact x is defined a direct dependency when x refers to d directly. • Transitive dependency: A d1 dependency of a software artifact x is defined a transitive dependency when d2 is a dependency of x and d1 is a dependency of d2. When a developer wants to distribute his code (e.g. a library) in a certain ecosystem, he can distribute a so-called package using a target packet man- ager. A package can contain a stand-alone tool as well as a library or a set of libraries. Programming languages use package managers to resolve the depen- dencies (packages) defined in the sofware dependency tree during the build of a software artifact in its ecosystem. How dependency information is stored and how a dependency is resolved de- pends strictly on the software ecosystem, where dependencies resolution can occur at compile time as well as at runtime. The way authors of a software arti- facts decide which dependencies to use and how to manage them may affect the reliability of the software, which may be compromised due to its dependencies. For example, if a developer use a package that is no longer present in a given package manager registry, his software will no longer be installable using that package manager. This is the case of the npm package left-pad [3], a pack- age composed of a single function of 11 lines. When its owner, a self-taught high school graduate programmer from Turkey, removed it from the npm reg- istry on May 2016, many packages that depended on it could not be installed. Among them, the most famous were Babel, Atom, React. 5
  7. 7. 2.1.2 Software supply chain The software supply chain is the set of elements of any type that plays a role in the software development life cycle (SDLC) of a given software artifact. Dependencies of a software artifacts are certainly part of its supply chain, and software artifacts have more and more dependencies: the authors of [4] report how, up to 2017, the total number of dependencies of the packages of the package managers they considered (cargo, cpan, cran, npm, nuget , pack- agist, and rubygems) continues to grow over time. Also, software developers relies on third party dependencies (open source or proprietary) simply to not reinvent the wheel, speeding up the development process. However this massive use of third party dependencies helps to in- crease the number of elements in the software supply chains. The increase of dependencies in a software artifact leads to the problem of the so-called dependency hell, a colloquial term to define various problems related to the increase of dependencies in a project, including the size and the number of de- pendencies, the presence of conflicts or the presence of circular dependencies. According to the 2022 Verizon Data Breach Investigation Report DBIR [5], supply chain attacks have increased last year, accounting for 9% of their total incident corpus. 2.1.3 Open source and supply chain The use of open source software artifacts has the benefits of using products that are surrounded by strong communities that have the aim of improving their softwares. In addition, the usage of an open source software artifact does not require any cost and its code is freely available to anyone. However, the downside is that having the source code available constantly is a great advantage for an attacker, and as highlighted by [6] even new vul- nerability fixes can help the attacker in finding similar vulnerabilities in other softwares. Another problem is that adding an artifact to our supply chain means completely trust its author, and an open source artifact can be either managed by a large important organization or a single person (e.g. left-pad package). In the 2021 State of the Software Supply Chain Report [7], Sonatype reports a 650% increase in detected supply chain attacks aimed at exploiting weaknesses in upstream open source ecosystems. The year before, Sonatype registered an increase of the 430% of these supply chain attacks. The threat model of these attacks take in consideration an attacker that in- jects malware directly into open source projects to infiltrate the supply chain. As per the Sonatype report, this software supply chain attacks are insidious because attackers do not wait for public vulnerability disclosures to exploit: 6
  8. 8. instead, they inject new vulnerabilities into open source projects which are part of the supply chain. In this work, only dependencies such as libraries and open source packages used within software artifacts will be considered as elements of the supply chain. In detail, the ecosystems considered will mainly concern open source dependencies of ecosystems of specific languages and specific package man- agers, such as npm for Javascript or Maven for Java. However, it should be emphasized that the components of the software supply chain do not only concern those taken into consideration in this work. For example, a software artifact published in a production environment with the help of a package manager will have that package manager in its supply chain, as well as all the other softwares used in the publication process. Regarding the software artifacts taken into account, in this work have been considered only open source projects of public administrations. 2.2 SBoM 2.2.1 Software Bill Of Materials As noted in the previous sections, dependency management and supply chain attacks are critical issues in managing a software artifact. On May 12, 2021, the President of the United States signed an Executive Order [1] with standards and best practices concerning the Nation’s Cyberse- curity. In the “Enhancing Software Supply Chain Security” section, the use of SBoMs is recommended to keep track of the supply chain relationships of software applications. The SBoM concept born as a collaborative community project driven by the National Telecommunications and Information Administration’s [8] in 2018. The Executive Order gave a brief description of what a SBoM is: The term “Software Bill of Materials” or “SBoM” means a for- mal record containing the details and supply chain relationships of various components used in building software. A machine-readable SBoM format allows a software vendor to track each de- pendencies (libraries, other softwares or components) of its software artifact and make sure they are up to date in a programmatic way. As per the Ex- ecutive Order, buyers can use a SBoM to perform vulnerability analyses to evaluate the risk of a product and vendors can determine wether their soft- ware are at potential risk of a newly discovered vulnerability. A machine- readable SBoM can be used with automation and integration tool to exploit its potential of understanding the supply chain of a software. 7
  9. 9. 2.2.2 Standards Several SBoM standards have been developed to provide a unified approach for generating SBoMs and sharing them. A SBoM standard provide a schema that describe a software artifact structure and dependencies in a way that is consumable by other tools for monitoring and managing the supply chain (e.g. software for vulnerability analysis). Among the various standards, the most common are: • Cyclone DX [9], a lightweight SBoM standard designed for use in appli- cation security contexts and supply chain component analysis. It allows SBoM to be created in xml or json formats; it is an open source project maintained by the OWASP Foundation. • SPDX (Software Product Data Exchange) [10], an open standard hosted by the Linux Foundation for communicating software bill of material information. It allows SBoM to be created in yaml, json, xls and rdf formats. 2.2.3 Tools There are several tools for the SBoMs creation. These tools can take as input a docker image, other SBoMs or a source folder. They automatically search for components used within a project, such as libraries and packages used within a certain ecosystem by parsing files containing the list of dependencies for that ecosystem. Some of the software for creating SBoMs are Syft [11] from Anchor and sbom- tool [12] from Microsoft. They are both publicly available on GitHub. While the former can generate SBoM in CycloneDX, SPDX, and Syft’s own format, the latter can generate SBoM only in SPDX 2.2 format. Cyclone DX project has a collection of tools in the Tool Center section of its website [9] that can be used to build SBoMs in Cyclone DX format. In this work sbom-tool by Microsoft is used. Sbom-tool is capable of building SBoMs in SPDX 2.2 format, taken as an input the source folder of a project. 2.3 Review of the Literature 2.3.1 Top Five Challenges in Software Supply Chain Security: Ob- servations From 30 Industry and Government Organizations In the work in [4] the authors have conducted three software supply chain security summits (two industry and one government summit). A total of 30 organizations from different sectors but all from the United States attend these 3 summits. In the paper, the five most important challenges in supply chain security that were identified among these summits are presented. Challenge 1 - Updating of vulnerable dependencies. It has been noted how 8
  10. 10. a quick update to the latest version of a vulnerable dependency can introduce malicious code. Among the participants of the summits, there was advice such as: never be the first or last to update a dependency; adopt continuous integration/continuous deployment (CI/CD) policies to prevent the inclusion of vulnerable dependencies; maintain a “zero trust” policy for dependencies. Challenge 2 - Leveraging the SBoMs for Security. The US resident’s ex- ecutive order brought the old concept of SBoMs to the forefront. Some par- ticipants found the sharing of SBoMs harmful: the use of a dependency is not atomic and often the developers pull in only specific parts. So why not simply request accurate and timely vulnerability information? Other participants felt that SBoMs provide a way for a zero-trust approach for the supply chain and that SBoMs have the potential to lay the foundations for innovative security improvements that leverage SBoMs. Challenge 3 - Choosing Trusted Supply Chain. A crucial point is whether to trust the maintainers of a library, an organization, or the integrity of the build environment. Package maintainers are looking for ways to automatically iden- tify malicious packages, for instance by using only the metadata of a package. However, all techniques used so far present technical challenges. Challenge 4 - Securing the Build Process. The recent use of CI/CD tools opens up the possibility for new attacks to inject malicious code during the build process. Participants were largely positive about the use of Supply Chain Layers for Software Artifacts (SLSA, a framework that provides a checklist of standards to be met during the build process). However, there is still a lack of knowledge about which the risks are. Challenge 5 - Getting Industry-Wide Participation. The big tech compa- nies have been working a long time to solve supply chain security problems, with great (and manual) efforts that only help their company. However, some of the major players are already coming together. Some noteworthy guidelines and methodologies are the Building Security in Maturity Model (BSIMM) and the Open Web Application Security Project (OWASP). 2.3.2 An Empirical Comparison of Dependency Network Evolution in Seven Software Packaging Ecosystems The work in [13] analyses the dynamical evolution of package dependency networks for seven packaging ecosystems (defined by seven package managers): cargo (Rust), cpan (Perl), cran (R), npm (Javascript), nuget (.NET), packagist (PHP), rubygems (Ruby) using the libraries.io dataset (LINK). The temporal data covers the period of time from the birth of the ecosystem until the end of 2016. The authors analysed the seven ecosystem in order to answer to 4 research questions: • How do package dependency networks grow over time? • How frequently are packages updated? 9
  11. 11. • To which extent do packages depend on other packages? • How prevalent are transitive dependencies? The results obtained are summarized below: • All the ecosystems taken into account seems to grow over time (in terms of number of packages). The ones that seem to grow faster are cran and npm, which do so exponentially. The ratio of dependencies over packages remains stable for cpan, packagist, rubygems, while increases for all the others. • In all ecosystems the number of package updates is always stable or is growing over time. The majority of updates: – Come only from a small set of active packages. – Involve packages that are no older than 12 months. • A majority of dependent packages depend on a minority of required packages. Among the latter, only a small subset of them produce an high proportion of reverse dependencies. • In all ecosystems, more than half of the top level packages (packages that are not dependencies) have a dependency tree of depth greater or equal to three. 2.3.3 Structure and Evolution of Package Dependency Networks Similarly to what was done in the previous work, also the work in [14] fo- cuses on analysing the structure and evolution of networks of dependencies for different ecosystems. The ecosystems considered in this work are related to Javascript, Ruby and Rust. Packages data have been obtained from the cen- tral repositories for Javascript and Ruby (respectively npm and RubyGems) and from GitHub for Rust. Analyses also considered end user applications from GitHub for all three ecosystems. The authors divided the analyses trying to answer to the following 3 research questions: • What are the static characteristics of package dependency networks? • How do package dependency networks evolve? • How resilient are package dependency networks to a removal of a random project? The results obtained are summarized below: • The number of transitive dependents for JavaScript is almost two times the number of transitive dependents for other languages. In addition dependency networks (represented as direct graphs) of all the ecosys- tems present a giant weakly connected component (composed by 96.14%, 98.2%, 100% of projects for Rust, JavaScript and Ruby, respectively). 10
  12. 12. • The total amount of dependencies for each project release grows faster for JavaScript projects then for the others, with an average size of to- tal dependencies that registered more than 60% yearly growth between 2015 and 2016. Looking at the variation of the numbers of direct and transitive dependencies between 2005 and 2017, it can be noted that JavaScript projects have more transitive dependencies than the others, but less direct dependencies. The authors suppose that the high number of transitive javascript dependencies is due to the possibility of having multiple versions of the same package in a given project. • Every studied ecosystem taken into account has at least one package whose removal could impact the 30% of the other packages (and appli- cations). Among the more dependent packages, those of Javascript are utility packages, those of Ruby are packages related to web servers and those of Rust are interfaces to system level types and libraries. 11
  13. 13. 3 GitHub data collection 3.1 GitHub and Government In this work, open source organizations of the public administration of dif- ferent countries are considered. To do this, the GitHub page “GitHub and Government” [15] is used. The GitHub and Governemnt page contains a list of the GitHub organizations registered as government agencies at the national, state and local level for different countries. This GitHub list is subdivided in three sections: • Governments (includes 887 organizations) • Civic Hackers (includes 305 organizations) • Government-funded Research (includes 123 organizations) The sections “Governments” and “Government-funded Research” are subdi- vided in subsections, each associated with a specific country (except for “Eu- ropean Union” and “United Nations” subsections, that have respectively 1 and 10 organizations). The section “Civic Hackers” is subdivided in a few subsections with no clear association with any country: • Civic Hackers (includes 159 organizations) • Code for All (includes 18 organizations) • Code for America (includes 108 organizations) • Open Knowledge Foundation (includes 20 organizations) The list of organizations was obtained on October 11th, 2022. After obtaining the list, a country is assigned to each organization, as follows: • If org was listed in a subsection associated with a country c, org is associated with c. • If org metadata (obtained with repos API of GitHub [18]) include geo- graphical information, then org is associated with the country c resulting from those metadata. • If an organization is listed in the “Civic Hackers” section and “Code for America” subsection, then org is associated with country c=“United States”. • Otherwise, org is not associated with any country. Based on this procedure, 73 countries with at least one organization are found and 102 organizations not associated with any country (these include 10 or- ganizations listed as “United Nations” and 1 listed as “European Union”) are found. 12
  14. 14. In addition, 11 organizations were removed from the list. This is because, although appearing in the GitHub page, they are no longer available on the platform. 3.1.1 Organizations characteristics per country Figure 1 shows the number of GitHub organizations of the GitHub list for the top 16 countries per number of organizations. It can be seen that the United States has over 500 organizations, followed by the United Kingdom with just over 100 organizations. The number of organizations decreases rapidly down to Italy, in 16th place, with 9 organizations. Figure 1: Number of GitHub organizations per country for the top 16 countries per number of organizations. Figure 2 shows the fraction of the number of organizations that existed in a certain year for each country. The countries taken into consideration are the 16 countries of Figure 1. It can be seen that trends are similar for all countries, with exponential growth in the period 2012 - 2018. More than half of the countries reached the 80% of the total number of organizations by the beginning of 2017. 13
  15. 15. Figure 2: Fraction of the number of existing GitHub organizations per country over the years. Figure 3 shows the average number of members and followers of the GitHub organizations of the 16 countries with the most number of organizations. Interestingly, the higher average number of followers is obtained by italian organizations and it is more than 60. This is due to the organization Devel- opers Italia, which registered 510 followers on 11 October 2022. Apart from Italy, the countries with the most followers on average are the United States, United Kingdom and France, with an average number of followers that is more than 10. For all countries, the average number of members of a GitHub organization is less than 10. 14
  16. 16. Figure 3: Average number of members and followers for the top 16 countries per number of organizations. Figure 4: Number of existing GitHub repositories per organizations for the top 16 organizations per number of repositories. The organizations on the GitHub list with the largest number of repositories are now considered individually. Figure 4 presents the cumulative curves of the number of repositories per creation date per organization. The organizations 15
  17. 17. considered are the 16 organizations with the largest number of repositories in the entire GitHub list. It can be seen that all the organizations in Figure 4, except for navikt and bcgov, are from the United States or the United Kingdom. The organization with the most repositories is navikt, from Norway, which has almost 1750 repositories. A total of 9 organizations have more than 1 000 repositories. 3.1.2 Repositories characteristics per country Figure 5: Total number of repositories of all organizations of the top 16 countries per number of organizations. Figure 5 shows the total number of repositories among all organizations in each country. It can be seen that the United States and the United Kingdom, the countries with the highest number of organizations, also have the highest number of repositories. In particular, the United States has a total of over 25K repositories, almost twice the number of repositories of the United Kingdom. Figure 6 shows the average number of repositories per country. The aver- age is computed across the organizations of each country. The highest value is recorded by the United Kingdom, with an average of more than 140 repositories per organization. Right after United Kingdom, there are Norway, Finland and Italy, with an average of more than 80 repositories per organization. It can be noted from Figure 3 that for these 4 countries, the average number of members per organization is consistently less than 10. The most extreme case, concerns the United Kingdom organizations: on average, they have more than 140 repositories and less than 3 members per organiza- tion managing them. 16
  18. 18. Figure 6: Average number of repositories of all organizations of the top 16 countries per number of organizations Given a repository, GitHub provides the amount of code (in terms of kB) in the repository for each programming language. It then assigns a language to each repository as the most frequent language in the repository. Figure 7 presents the relative number (compared to the total number of repos- itories) of repositories with the most frequent language for the repositories of all organizations in each of the 16 country already taken into account. It can be seen that Python is the most frequent language in 6 countries, fol- lowed by JavaScript which is the most frequent in 5 countries. It can also be seen that for the repositories of Norway, Kotlin is the most frequent language, while for those of Sweden and Spain, PHP is the most frequent. On the other hand, for most of the repositories of the United States, GitHub does not specify any language; this is the case of repositories that do not contain code but contain other things, such as documentation or datasets. Figure 7: Fraction of the total number of repositories labeled with the most popular language for the top 16 countries per number of organizations. Among the cases in which no language is assigned to a repository, there is also the case in which a repository is empty. Figure 8 shows the relative number (compared to the total number of repositories) of empty repositories of all organizations in each of the 16 country. It can be seen that for 7 countries, 17
  19. 19. the number of empty repositories is more than the 2% of the total. The highest value is recorded by the repositories of Mexico, where more than 5% of them are empty. Figure 8: Fraction of empty repositories with respect to the total number of repositories for the top 16 countries per number of organizations. 3.2 The case study: 4 datasets for Italy, Germany, the United States and the United Kingdom In this work, it is decided to consider a limited number of countries: Italy, Germany, the United States and the United Kingdom. Since the entire set of repositories of countries United States and United Kingdom amount to more than 28K and 15K respectively, it has been decided to not study these sets entirely. In particular, it is chosen to analysed the following Github repositories: • The repositories of all the 9 organizations with country = “Italy” (713 repositories). • The repositories of all the 30 organizations with country = “Germany” (1 308 repositories). • The repositories of the organization US General Services Administration, country = “United States” (937 repositories). • The repositories of the organization Goverment Digital Service, country = “United Kingdom” (1 563 repositories). From now on, these four datasets will be named using the abbreviations IT, DE, US and UK. 18
  20. 20. (a) IT (b) DE (c) UK (d) US Figure 9: Fraction of GitHub repositories with respect to the total number of repositories per language for the 4 datasets. The languages taken into account are those for which, among all datasets, there are at least 30 repositories marked with that language. The distributions of the programming languages associated with each reposi- tory for each dataset are summarised in Figure 9. The figure shows what emerged in the previous section: the most used lan- guages in the repositories of Italian and German organizations are Python and JavaScript respectively. In the US dataset, most of the repositories (more than 20%) do not contain any code, while in the UK dataset the most used language is Ruby. 19
  21. 21. Size (MB) Stars Avg Med Max Avg Med Max IT 47.52 0.72 4432 12.86 1 3929 DE 19.22 0.59 1818 11.44 0 3352 US 24.05 0.30 1511 9.46 1 1899 UK 12.59 0.35 8299 7.16 1 768 Watchers Forks Open Issues Avg Med Max Avg Med Max Avg Med Max IT 9.43 8 208 7.68 1 2297 5.75 1 383 DE 5.37 4 208 3.11 0 601 4.00 0 403 US 10.83 9 287 6.17 2 457 5.23 0 741 UK 24.10 18 131 4.99 2 275 2.77 0 280 Table 1: Average, median and maximum of repositories size, number of stars, number of watchers, number of forks and number of open issues for the 4 datasets. Table 1 shows the average, maximum and minimum values of the repository sizes, the number of stars, watchers, forks and open issues for all repositories of the 4 datasets. It can be seen that the repositories of the IT dataset have the largest average size and the highest number of stars. A watcher, for a GitHub repository, is an user that follows the repository to stay up-to-date on it. The UK dataset has on average the highest number of watchers, while on average the IT dataset has the highest number of open issues and forks. It can be also noted that the median values of the number of stars, forks and open issues are close to 0 and far from the average values for all datasets. This indicates that for each of these parameters, the major outliers of the related distribution come from the low end of the distribution. Regarding the number of repositories, Figure 10 displays the creation date and date of last update for each repository of the 4 datasets. In detail, each segment of a graph represent a repository and: starts at the creation date and end at the last update date of the repository. Segments are ordered from bottom to top by creation date. For each segment, the left extreme point represents the creation date of the repository relative to that segment. Now take the set of left extreme points of all the segments. The resulting curve formed by these points represents the cumulative number of repositories in the dataset over the years. These curves show that the number of repositories steadily increases in all the datasets; this indicates how the organizations of the 4 datasets appear to be active from 2013 to the present day. 20
  22. 22. (a) IT (b) DE (c) UK (d) US Figure 10: For each dataset, a set of stacked segments where each one: represent a repository, starts at the creation date and end at the last update date of the repository. Segments are order from bottom to top by creation date. 21
  23. 23. Although the number of repositories of organizations of each dataset seems to be growing steadily, not all repositories are constantly updated. In particular, in the last year (since 11 October 2021) have been updated: • The 50.35% of the repositories of the IT dataset. • The 86.54% of the repositories of the DE dataset. • The 83.03% of the repositories of the UK dataset. • The 42.50% of the repositories of the US dataset. In particular, it can be observed from Figure 10 that: • A lot of the IT repositories created between 2013 and 2015 were almost immediately inactive. • A large portion of US repositories became inactive in early 2018 (there is an almost visible vertical line just after 2018). 22
  24. 24. 4 Methods and Analyses 4.1 SBoM creation and vulnerability detection After obtaining the metadata for the organizations and repositories of the IT, DE, UK and US datasets, the following actions have been executed for the 4 datasets. Step 1 - First, each repository of the 4 datasets is downlaoded and the Mi- crosoft sbom-tool [12] is executed on all the repositories. This tool takes the set of source files that compose a software artifact (e.g., a Github reposi- tory, a Python package and alike) as input and provides the corresponding software bill of materials (SBOM) in the SPDX 2.2 format as output. The constructed SBoM contains the list of dependencies of the repository. Such list is constructed by the component-detection package scanning tool, also avail- able on Github [16]. The programming language ecosystems supported are: CocoaPods, Conda, Gradle, Go, Maven, npm, Yarn, NuGet, PyPi, Poetry, Ruby, Cargo. Full details can be found in the tool documentation [17]. Content and structure of the list of dependencies constructed by [16] depend on the software ecosystems used by the artifact under analysis. Specifically: • The set of dependencies is extracted from manifest files, whose names are expected to follow a predefined pattern specific to each supported ecosystem. • For most supported ecosystems, transitive dependencies (i.e., dependen- cies of dependencies) are also extracted. Dependencies take the form of a tuple (package manager, package name, pack- age version). In a given repository, there could be multiple dependencies from a given (package manager, package name) pair. If those multiple dependencies result from manifest files in different folders, then sbom-tool includes one tuple for each different package-version found. Otherwise, i.e., when the multiple dependencies result from the same manifest file, sbom-tool may include one or more tuples (each with a different package-version) according to a certain set of ecosystem-specific rules. Step 2 - Second, it is attempted to find in the analysed repositories fur- ther dependencies possibly missing from those obtained by sbom-tool at the previous step. This additional step is executed because preliminary analyses showed that manifest files might not be fully accurate in listing the depen- dencies that can indeed be found statically. Specifically, it is constructed a further list of dependencies for each repository, that is kept separate from the one constructed at the previous step, by executing the following actions: 1. Execute the check-imports [18] package on each repository associated with the JavaScript language. This package is publicly available on GitHub and constructs dependencies for JavaScript and TypeScript source files based on static analysis. 23
  25. 25. 2. Execute the pipreqsnb tool [19] on each repository associated with the Python language. This tool is publicly available on Github and con- structs dependencies for Python and Python notebook source files based on static analysis. 3. Remove from the dependencies obtained at steps 1 and 2 those already found by sbom-tool (independently from their versions). 4. For each Python repository, add the dependencies found at step 2 to the requirements.txt files, execute the sbom-tool again and retain the dependencies from (package-manager, package-name) pairs that had not been found by sbom-tool. This make it possible to collect a further degree of transitive dependencies. The analyses 1-4 are executed by skipping all files and folders whose names contain one of the following keywords: development, optional, enhances, sug- gests, build, configure, test, develop, dev, example, doc (in order to not take into account any dependencies that are either necessary only for development or that are optional). In summary, two lists of dependencies are constructed for each repository: • manifest dependency: those obtained by the sbom-tool at Step 1. These dependencies may involve packages of any ecosystem supported by sbom-tool. • parsed dependency: those obtained by check-imports, pipreqsnb and a further sbom-tool execution at Step 2. These dependencies may involve only JavaScript, TypeScript and Python packages. The content of the two lists is disjoint by construction. It is chosen to not distinguish between direct and transitive dependencies. Then, each repository is analysed with the Grype tool available on GitHub [20]. This tool scans a set of source files in search of dependencies and lists the known vulnerabilities of the dependencies found. The input source files may actually be specified also as a SBoM, in which case the packages are obtained from the SBoM directly. In detail, the following steps are executed: • For each SBoM s found in step 1, analyse s with Grype. • For each dependency p found in step 2, search the vulnerabilities of p in the Grype database. Moreover, the first step was executed twice with different parameters, resulting in two different sets of vulnerabilities: • grype: set of vulnerabilities obtained from the execution of Grype. 24
  26. 26. • grype cpe: set of vulnerabilities obtained from the execution of Grype using the add-cpes-if-none parameter. This parameter attempts to generate CPE information (CPE is a structured naming scheme for in- formation technology systems [21]) if not present in the SBoM. In partic- ular, this implies that if the CPE identifier for a dependency is present in the SBoM, then running Grype with or without the add-cpes-if-none parameter would lead to the same set of vulnerabilities. In addition, for all vulnerabilities collected with Grype, an attempt was made to obtain other metadata not provided by Grype, using one of the following: • CVE API provided by the NIST National Vulnerability Database [22]. • Scraping of GitHub Advisory Database [23] website pages. Finally, a different set of vulnerabilities called osv api is obtained using the API of OSV [24], a distributed vulnerability database for Open Source. More specifically, the API is queried for each tuple (package manager, package name, version) and the vulnerabilities found are stored separately from those ob- tained with Grype. Details on the vulnerability data sources used by Grype and OSV API can be found in the respective documentation. 25
  27. 27. 4.2 Database Figure 11: Relational database schema realised for data storage. All the data collected concerning GitHub organizations, dependencies and vul- nerabilities are stored in a relational database. The schema of this database can be seen in Figure 11. The relational database tables are summarised below: • organization. The table defines a GitHub organization. It contains the collected properties for that organization. The primary key url is the GitHub profile url. 26
  28. 28. • repository. The table defines a GitHub repository. It contains the collected properties for that repository. The primary key url is the url of the GitHub repository. The field organization is a foreign key that refers to the table organization. • user. The table defines a GitHub user. It contains the username and creation date of the user’s profile. The primary key user name is the username of the GitHub user. • contributor. The table defines the contribution relationship between a GitHub user and a repository. It contains the properties of this re- lationship, such as the number of contributions, the number of rejected pull requests and the maximum number of commits made on a certain day. The primary key is the (user name, repository) pair, where the user name field is a foreign key that refers to the user table and the repository field is a foreign key that refers to the repository table. • package. The table defines the package entity as the tuple (package manager, package name, version). The primary key purl is the purl (package url) [25] of the package. • manifest dependency. The table defines the dependency relationship between a repository and a package. This dependency relation belongs to the dependency list manifest dependency (defined in subsection 4.1). The primary key is the (repository, package) pair, where the repository field is a foreign key that refers to the repository table and the package field is a foreign key that refers to the package table. • parsed dependency. The table defines the dependency relationship between a repository and a package. This dependency relation belongs to the dependency list parsed dependency (defined in subsection 4.1). The primary key is the ( repository, package) pair, where the repository field is a foreign key that refers to the repository table and the package field is a foreign key that refers to the package table. • vulnerabilility. The table defines a vulnerability. It contains the pri- mary properties collected for that vulnerability. The primary key id is the id of the vulnerability. The vulnerability id does not have a fixed structure but depends on the origin of the vulnerability. In detail, ids could start with: – CVE: vulnerability comes from one of the publicly available vul- nerability data sources used by Grype and it has been assigned a CVE ID number to that vulnerability. – GHSA: vulnerability comes from GitHub Advisory Database [23]. In particular vulnerabilities found that come from GitHub are GitHub- reviewed advisories. These advisories are vulnerabilities that have been mapped to packages in ecosystems that GitHub supports (fur- ther information can be found in GHSA documentation [26]). 27
  29. 29. – GO: vulnerabiltiy comes from Go Vulnerability Database [27]. – OSV: vulnerability comes from OSS-Fuzz Database [28]. – PYSEC: vulnerability comes from PyPI Advisory Database [29]. – RUSTSEC: vulnerability comes from Rust Advisory Database [30]. • vulnerabilility metadata. The table defines metadata of a vulnera- bility. It contains all the properties collected for that vulnerability. The primary key id is a foreign key that refers to the vulnerability table. • grype potential affection. The table defines the affection relationship between a package and a vulnerability. This affection relation belongs to the grype set of vulnerabilities (defined in subsection 4.1). The primary key is the (vulnerability, package) pair, where the vulnerability field is a foreign key that refers to the vulnerability table and the package field is a foreign key that refers to the package table. • grype cpe potential affection. The table defines the affection rela- tionship between a package and a vulnerability. This affection relation belongs to the grype cpe set of vulnerabilities (defined in subsection 4.1). The primary key is the (vulnerability, package) pair, where the vulnera- bility field is a foreign key that refers to the vulnerability table and the package field is a foreign key that refers to the package table. • osv api potential affection. The table defines the affection relation- ship between a package and a vulnerability. This affection relation be- longs to the osv api set of vulnerabilities (defined in subsection 4.1). The primary key is the (vulnerability, package) pair, where the vulnerability field is a foreign key that refers to the vulnerability table and the package field is a foreign key that refers to the package table. It is noted how the defined structure makes it possible to handle the fact that a vulnerability may be associated with more than one (package manager, package name, package version) tuple. In detail, the tables vulnerability and package represent a single vulnerability and a single (package name, package version) pair respectively. The potential affection tables, on the other hand, represent the relationship between a vulnerability and a pair, where a vulner- ability may affect several pairs. Take as an example the case of a vulnerability v collected with Grype (without the add-cpes-if-none parameter) for different versions of a package x . In this case: • Assume that, after the execution of the sbom-tool over several reposi- tories, a record (package manager, package name, package version) for each version x1, . . . xn of x has been inserted in the table package. Let these records be called ̃ x1, . . . ̃ xn. • Assume that, after the execution of Grype over the SBoMs built in the previous step, the vulnerability v was found with Grype for all the pairs ̃ x1, . . . ̃ xn . A record for the vulnerability is inserted in the table vulnerability. Let this record be called ̃ v. 28
  30. 30. • A record is inserted in the grype potential affection table for each pair (̃ v, ̃ xi) for each i ∈ {1, . . . , n}. organization repository contributor 1 315 75 407 33 718 user package vulnerability 13 685 62 758 3 762 manifest dependency parsed dependency grype cpe potential affection 559 586 6 649 15 300 osv api potential affection grype potential affection vulnerability metadata 11 209 9 989 3 633 Table 2: Cardinalities of the tables of the database. Table 2 shows the cardinality of the database tables. While the organization and repository tables contain entries from all GitHub organizations in the GitHub and Government list (see subsection 3.1), all other tables contain data from the IT, DE, UK, US datasets only. 4.3 Creation of SBoMs: critical issues 4.3.1 Conditional dependencies Dependencies for a software artifact often have to be specified in a conditional way, that is, depending on specific properties of the target environment in which the artifact will be built or executed. The rules for specifying such con- ditional dependencies usually depend on the ecosystem. Two examples related to the PyPI and Maven ecosystems are provided below. Example 1 The requirements.txt dependencies file for PyPi (Python), as per the PEP 508 [31], allow the following notation for conditional dependencies: requirements.txt numpy==1.20.3; python_version<"3.6" and sys_platform!="linux" In this case, version 1.20.3 of the numpy package is required only if the envi- ronment is not Linux and the version of Python used is lower than 3.6. Example 2 The specification of Project Object Model pom.xml file for Maven (Java) [32] defines build profiles to handle equivalent but different parameters for a set of target environments. Below is an example of a part of a pom.xml file that uses build profiles. pom.xml <dependencies> <dependency> 29
  31. 31. <groupId>com.google.cloud</groupId> <artifactId>google-cloud-logging</artifactId> <version>3.12.1</version> </dependency> </dependencies><profiles> <profile> <id>profile_1</id> <dependencies> <dependency> <groupId>com.sun.jersey</groupId> <artifactId>jersey-json</artifactId> <version>1.19.4</version> </dependency> </dependencies> </profile> <profile> <id>profile_2</id> <dependencies> <dependency> <groupId>io.spray</groupId> <artifactId>spray-json_3</artifactId> <version>1.3.6</version> </dependency> </dependencies> </profile> </profiles> In this case, a global dependency (google-cloud-logging) and two build profiles are defined in the pom.xml file. Depending on the profile used, the required dependency will either be jersey-json (profile 1) or spray-json 3 (profile 2). When handling conditional dependencies, sbom-tool (more precisely, the com- ponent detection tool) adopts a predefined behavior that depends on the ecosystem and, most importantly, omits any conditional dependency in the generated SBoM. With reference to the above examples: • In handling the requirements.txt file of Example 1, sbom-tool ignores the dependency conditions, and adds the numpy package to the dependen- cies, irrespective of the current Python version or the current system platform. • In handling the pom.xml file of Example 2, sbom-tool ignores the pres- ence of build profiles, adding only the google-cloud-logging package to dependencies. The fact that the generated SBoM does not describe conditional dependencies could be a limitation of the SBoM standard. As per the NTIA guidelines [33], a SBoM could be created for each software component at the moment it is built, packaged or delivered. This implies that when a software is delivered as source code, it may have a SBoM. Moreover, if it is delivered as source code it may have conditional dependencies, that have to be stored in the SBoM. 30
  32. 32. It is not clear how information on conditional dependencies should be stored when creating SBoMs that follow standards such as Cyclone DX and SPDX. More precisely, to the best of the author’s understanding: • The CycloneDX standard has no specific attributes for the insertion of conditional dependencies. This information can be inserted into the SBoM only as a non-machine readable comment. • The SPDX 2.2 standard has no specific attributes for the insertion of conditional dependencies. This information can be inserted into the SBoM only as a non-machine readable comment. Moreover from the SPDX 2.2 documentation [34], it is possible to define relationships be- tween the software and its dependencies. The closest relationship type that can represent conditional dependencies is the type of optional de- pendencies. Dependencies are optional ones only when building the code will proceed even without them. The description of the relationship type OPTIONAL_DEPENDENCY_OF say: Use when building the code will proceed even if a dependency cannot be found, fails to install, or is only installed on a specific platform. It is noted, however, that although there could be conditional depen- dencies that are optional ones, conditional and optional dependencies do not represent the same concept. It is also reported how the SPDX documentation concerning the attribute “External reference comment field”, is slightly different between web [34] and pdf [35] versions of the documentation. In particular, the pdf version uses the term “conditional” to describe the field Cardinality, while it would be more correct to use the term “optional”. 4.3.2 Version constraints The structure and stringency of the files that store dependencies depends on the ecosystem. For example, the requirements.txt file for PyPi allows a depen- dency to be added without specifying its version. In this case, the dependency is collected by the sbom-tool using the latest version released on PyPi as the version (this is also the case when transitive dependencies are obtained by querying PyPi’s index). In most other cases, the specifications on manifest files structure are much more stringent. In particular, manifest files can be divided into two groups: • Files that allow dependencies to be specified without fixing a version, using specific syntaxes (e.g. >=, caret, tilde) to define specific version ranges based on the semantic versioning system. • Files that require to specify the version of a dependency used; these include the lock file category: files that contain a snapshot of all the 31
  33. 33. dependencies used (both direct and transitive) with their versions. De- pendencies are said to be locked. In general, the SBoM concept requires dependencies to be locked. Given that, a manifest file with dependencies of the form >= could lead to a SBoM where the versions of the dependencies are the latest versions released at the time the SBoM was built. However, this may not be the appropriate behaviour: it would be more meaningful to store the last version at the time of the last update of the software artifact (e.g. the last commit of a GitHub repository or the date of last publication for a package). Now take the following example concerning the creation of a SBoM for a GitHub repository. Example 1 Repository X is created on 17 May 2022. It contains some code files and the requirements.txt file, with the following content: requirements.txt tensorflow-gpu >= 2.9.0 The author uses tensorflow-gpu package with the latest version released, 2.9.0 (released on 16 May 2022). The author no longer updates the repository. The sbom-tool tool is executed on 7 September 2022. The tensorflow-gpu package is collected with the latest version released, the 2.10.0 (released on 6 September 2022). In this case, it is not possible to find the CVE-2022-36-011 vulnerability, which affects version 2.9.0 but not 2.10.0 of the package tensorflow-gpu. In this case, what is missing is a standardized procedure for constructing SBoMs for a given GitHub repository. 4.4 Vulnerabilities collection: critical issues 4.4.1 Grype and Java vulnerabilities A first problem noted when using Grype was the non-identification of vulner- abilities for Java packages. Although existing, no vulnerabilities related to Java packages reported in the SBoMs built with sbom-tool were found when running Grype. This could be a compatibility problem between the SPDX SBoMs generated with sbom-tool and Grype. In fact, the Grype documentation states: Grype supports input of Syft, SPDX, and CycloneDX SBoM for- mats. If Syft has generated any of these file types, they should have the appropriate information to work properly with Grype. It is also possible to use SBoMs generated by other tools with varying degrees of success. 32
  34. 34. Interestingly, Java packages are the only packages collected for which all pack- age names are in the form string/identifier, where string is the reverse domain string of the organization that authored the package. For example: package name = org.springframework/spring-tx Although Grype seems to not be able to find Java vulnerabilities from SBoMs constructed by sbom-tool, by manually querying the Grype database using the full package names many Java vulnerability matches were found. Given that, it is assumed that Grype is not able to extract the full name of a Java package from the purl (package url) stored in a SBoM built by sbom-tool. For the grype and grype cpe vulnerability sets (see subsection 4.1), it is there- fore decided to add the vulnerabilities related to Java packages that are present in the database used by Grype without the direct use of Grype. In detail, the following steps are executed: • Obtain the list l of all Java packages stored in the table package of the database described in subsection 4.2. • For each (package name, package version) Java pair p = (pn, pv) ∈ l, search in the Grype database for Java vulnerabilities related to p. For each vulnerability v found: – Store v in the table vulnerability of the database described in sub- section 4.2. – Store the affection relation between vulnerability v and package p in both tables grype potential affection and grype cpe potential affection of the database described in subsection 4.2. 4.4.2 Grype false positive Another problem encountered when using Grype concerns the use of the add-cpes-if-none parameter. As described in subsection 4.1, the grype cpe vulnerability set is obtained by using Grype with that parameter. Below is the description of the add-cpes-if-none parameter in the Grype documentation: Two things that make Grype matching more successful are inclu- sion of CPE and Linux distribution information. If an SBoM does not include any CPE information, it is possible to generate these based on package information using the add-cpes-if-none flag. The problem regarding Grype matching with add-cpes-if-none is that sev- eral erroneous vulnerability-package associations were found. In these false positives, the pattern is that a certain vulnerability related to a generic soft- ware artifact with name X from ecosystem Y1 is associated with a package with the same name X but from ecosystem Y2. Some examples are reported below. Example 1 Grype associates vulnerability CVE-2020-7791 with RubyGems 33
  35. 35. package i18n (Ruby) when the vulnerability is actually associated with the package i18n for ASP.NET. Example 2 Grype associates vulnerability CVE-2020-18032 with PyPi pack- age graphviz (Python) when the vulnerability is actually associated with the software Graphviz Graph Visualization Tools. Example 3 Grype associates vulnerability CVE-2020-15133 with npm pack- age faye-websocket (JavaScript) when the vulnerability is actually associated with the RubyGems package faye-websocket (Ruby). Taking this problem into account (and assuming there may be other false positives), the author of this work decided to use a limited set of vulnerabil- ities for the analysis in the next section. In detail, it is chosen to use all the vulnerabilities present in all 3 vulnerability sets: grype, grype cpe and osv api. (a) Vulnerabilities (b) Affections Figure 12: Venn diagrams specifying how many vulnerabilities (a) and affections (b) are present in which set. Figure 12 shows the Venn diagrams of the number of vulnerabilities and af- fections (vulnerability-dependency pairs) for the 3 sets of vulnerabilities. The number of vulnerabilities that will be considered later will therefore be 1 778. These 1 778 vulnerabilities produce a total of 9 020 affections. 4.5 Dependency analyses The analyses of the dependencies found are now presented. As specified in subsection 4.1, dependencies are collected for the 4 datasets IT, DE, UK and US described in subsection 3.2. 34
  36. 36. 4.5.1 Packages and dependencies: Viewpoints of package man- agers, datasets and repositories (a) IT (b) DE (c) UK (d) US Figure 13: Distribution of (package name, package version) pairs among the various package managers for datasets IT,DE,UK and US. Figure 13 shows, for each dataset, the distribution of all the (package name, package version) pairs found among the various package managers. It can be seen that, although JavaScript is the most common language only for the repositories of dataset DE (see Figure 9), in all dataset the vast majority of packages come from npm. This large number of JavaScript packages is not surprising: JavaScript projects are known for their large number of dependen- cies. This is mainly due to the absence of a standard library and the related intensive use of third-party packages by developers. This has created a very large ecosystem: npm registred packages are more than all the packages of Maven, PyPi and RubyGems put together [36]. 35
  37. 37. Moreover, Node.js dependency management mechanism allows different ver- sions of the same packages to coexist within the same project, increasing the number of possible dependencies. (e) IT (f) DE (g) UK (h) US Figure 14: Distribution of (package name, package version, repository) tuples among the various package managers for datasets IT, DE, UK and US. Figure 14 shows, for each dataset, the distribution of all the (package name, package version, repository) tuples found among the various package man- agers. Here, npm is even more prominent: npm packages are on average used by more repositories of the same dataset than packages from another package manager. Interestingly, the UK dataset seems to be the one with the lowest percentage of npm dependencies compared to total dependencies. This is due to the large number of RubyGem dependencies present; Ruby is in fact the most popular language in the UK dataset. Furthermore, it is reported how all the dependencies, i.e. (repository, package) 36
  38. 38. pairs found, are distributed among the various datasets: • 13.04% of dependencies found come from dataset IT (an average of 103 dependencies per repository). • 40.48% of dependencies found come from dataset DE (an average of 176 dependencies per repository). • 30.78% of dependencies found come from dataset UK (an average of 112 dependencies per repository). • 15.69% of dependencies found come from dataset US (an average of 95 dependencies per repository). Datasets Repos. lang. IT DE UK US Python 14 6 14 9 JavaScript 12 18 26 27 Ruby 1 0 60 0 Go 0 1 18 2 Java 2 5 8 1 (a) Datasets Repos. lang. IT DE UK US Python 12 6 14 6 JavaScript 10 10 12 23 Ruby 1 0 59 0 Go 0 0 9 0 Java 2 1 2 1 (b) Table 3: Cross-language dependencies: For each language, number of repositories with at least a dependency of a different language (Table (a)) and with at least a dependency of that language and of a different language (Table (b)). A cross-language dependency for a repository is defined as a dependency of a different language from the one of the repository. Table 3 shows the number of repositories with at least one cross-language dependency (3-a) and with at least one cross-language dependency and one dependency of the repository’s language (3-b). In some subsequent dependency analysis, the repositories will be divided by language; this division will not take into account the cross-language dependen- cies of these repositories. However, from 3-a, it can be seen that the numbers of repositories with at least one cross-language dependency are just a few com- pared to the cardinalities of the datasets. Finally, it can be seen that the data in 3-b are very similar to those in 3-a, indicating that: when a repository has cross-language dependencies, it often has also dependencies of its own language. 37
  39. 39. Figure 15: On the y-axis, the number of dependencies. On the x-axis, the percentage of repositories of a target language that have at least y dependencies. Languages taken into account are JavaScript, Java, Python, Ruby, Go. Figure 15 shows the number of dependencies that come from a package man- ager related to a target language per percentage of repositories of that language that have at least that number of dependencies. Languages taken into account are the 5 languages with the highest number of dependencies found across the 4 datasets. For example, the point (x,y) on the Ruby curve says that there are x% of Ruby repositories that have at least y Ruby dependencies. Once again, the anomalous behaviour of JavaScript with respect to the other languages can be seen. In particular, it can be seen that almost 20% of the JavaScript repositories have more than 1 000 dependencies. Python on the other hand seems to be the language with the fewest number 38
  40. 40. of dependencies per repository, with only about 2% of Python repositories having more than 100 dependencies. 4.5.2 Critical dependencies Figure 16 shows the top 30 (package name, package version) pairs per per- centage of repositories that depend on them for each dataset. The first thing that can be seen is that almost all packages are JavaScript packages for all 4 datasets. Since information about dependency trees are not collected, it is not possible to distinguish between direct and transitive depen- dencies in this study. However, it can be assumed that many of these packages are present as transitive dependencies. This assumption is motivated by the data obtained in Table 4. (a) IT (b) DE 39
  41. 41. (c) UK (d) US Figure 16: Top 30 (package name, package version) pairs by percentage of repositories that depend on them for the datasets IT, DE, UK, US. Table 4 contains a set of properties per the top 10 (package name, package version) Javascript pairs per repositories that depend on them. The reposito- ries taken into account are the set of repositories of all 4 datasets. In fact, it can be seen that most of the packages in the top 10 of Table 4 are also present in the top 30 of each dataset (Figure 16). In detail, the properties specified by Table 4 are: the percentage of repositories that depend on the package, the number of dependent packages, the number of dependencies and the last existing version. 40
  42. 42. Name Repos. (%) Dependent Dependencies L. Version wrappy 1.0.2 10.40 1 159 0 1.0.2 once 1.4.0 10.35 2 175 1 1.4.0 concat-map 0.0.1 10.26 1 149 0 0.0.2 inflight 1.0.6 10.22 1 124 2 1.0.6 path-is-absolute 1.0.1 10.20 1 513 0 2.0.0 fs.realpath 1.0.0 10.18 1 109 0 1.0.0 util-deprecate 1.0.2 9.58 947 0 1.0.2 escape-string-regexp 1.0.5 9.49 3 217 0 5.0.0 isexe 2.0.0 9.40 971 0 2.0.0 isarray 1.0.0 9.29 1 060 0 2.0.5 Table 4: Top 10 JavaScript (package name, package version) pairs per number of repositories that depend on them. The repositories taken into account are the set of repositories of all 4 datasets. For each package is reported the percentage of repositories that depend on it, the number of dependent packages, the number of dependencies and the latest existing version. From Table 4 it can be noted that 4 out of 10 packages are not updated to the latest version. Among the 10 packages in the table, there is one, path-is- absolute, that is deprecated. As can be seen from Figure 16, the deprecated package path-is-absolute is used by 8.41%, 18.48%, 11.11%, 9.61% of all the repositories of the datasets IT, DE, UK, US respectively. 4.5.3 Manifest vs Parsed dependencies JavaScript Python Ruby Java Go Dataset man. par. tot. man. par. tot. man. man. man. IT 5.93 3.14 7.33 10.77 16.84 22.56 0.99 10.42 4.26 DE 27.4 9.95 30.89 11.62 12.76 17.51 0.79 24.31 2.13 UK 22.34 6.28 24.26 20.71 16.16 25.08 54.96 9.72 36.88 US 12.74 7.16 15.18 6.73 6.9 10.77 5.75 0.69 11.35 Table 5: Percentages (%) of JavaScript, Python, Ruby, Java and Go repositories for which at least one manifest dependency (man), parsed dependency (par) and any dependency (tot) is found. In Table 5 the percentages of JavaScript, Python, Ruby, Java and Go repos- itories for which at least one dependency is found are presented. Due to the way dependencies are collected (see subsection 4.1), the database constructed contains parsed dependencies only for JavaScript and Python. It is observed that a repository that contains code could have no dependencies of the manifest dependency category for one of the following reasons: • No third-party dependencies are used. • The third-party dependencies used are not specified in the dependency files. • Dependency files are not present or are malformed. 41
  43. 43. Table 5 shows that the highest percentage of repositories with at least one de- pendency found in the dependency files for JavaScript repositories is obtained by the DE dataset, with more than 27% of JavaScript repositories for which at least one dependency is found. The DE dataset also has the highest percentage of JavaScript repositories for which at least one parsed dependency is found. For the IT dataset, on the other hand, at least one manifest dependency is found for less than 6% of the JavaScript repositories. The Python repositories in the IT dataset seem to have the highest number of repositories that have dependencies not specified in the dependency files (this is the case for more than the 16% of the Python repositories in the dataset). Finally, it is noted that the UK dataset has the highest percentage of Ruby and Go repositories for which at least one dependency was found: 54.96% and 36.88% of Ruby and Go repositories respectively. In addition, looking at the data in both Figure 16 and Figure 9, one can see that: • 10.52% of the repositories in the IT dataset are JavaScript or TypeScript repository. More than 8% of the repositories in the IT dataset have at least one npm dependency. • 23.92% of the repositories in the DE dataset are JavaScript or TypeScript repository. More than 18% of the repositories in the DE dataset have at least one npm dependency. • 12.01% of the repositories in the UK dataset are JavaScript or Type- Script repository. More than 11% of the repositories in the UK dataset have at least one npm dependency. • 12.80% of the repositories in the US dataset are JavaScript or TypeScript repository. More than 9% of the repositories in the US dataset have at least one npm dependency. However, Table 5 shows that for the majority of the JavaScript repositories of all the datasets, no dependencies are found: this indicates that many of the repositories that depend on packages in Figure 13 are not JavaScript reposi- tories. Finally, data in Table 6 are reported. Table 6 contains the percentages of repositories that have no manifest dependencies that come from a given pack- age manager compared to repositories that have at least one parsed depen- dency that come from that package manager. These data is intended to answer the question: in cases where the developers do not specify at least one depen- dency, do they not specify just one, or do they not specify all of them? The results show that Italian developers are the ones who do not specify all of them the most: for almost 70% of the repositories in the IT dataset for which a parsed npm/pypi dependency was found, no npm/pypi dependencies are specified in the manifest files. 42
  44. 44. Dataset Package Manager IT DE UK US npm 68.75 43.10 50.00 42.31 pypi 69.81 47.37 30.21 63.41 Table 6: Percentages of repositories that have no npm/pypi manifest dependencies with respect to repositories that have at least one npm/pypi parsed dependency. 4.6 Vulnerability analyses The vulnerabilities found for all repositories of the four datasets IT, DE, UK and US are now analysed. In this work, a vulnerability associated with a repository is defined as a vul- nerability associated with a package used by that repository. It is noted that the fact that a package is vulnerable does not imply that a repository using that package is vulnerable. It could in fact be that the repository does not use the vulnerable units of the package. Moreover, the way a package is used may make it impossible to exploit the vulnerability. It is therefore said that the existence of a vulnerability associated with a repository makes the repository potentially vulnerable. A total of 3 792 vulnerabilities were collected. Among these, 3 576 have a CVE identifier (see subsection 4.2). CVE Identifier for a given publicly known vulnerability is an identifier of that vulnerability published by the CVE (Com- mon Vulnerabilities and Exposure) system. Entities that can publish a CVE record are called CVE Numbering Authorities (CNAs). The CVE system defines the CVSS (Common Vulnerability Scoring System CVSS), an open set of standards used by many organizations such as NVD, IBM or Oracle to assign a severity measure to a vulnerability. CVSS as- signs a severity score from 1 to 10 to a given vulnerability. This score takes into account certain parameters such as the vulnerability’s exploitability, the vulnerability’s scope and the impact of an attack in case the vulnerability is successfully exploited. More information can be found in the documenta- tion [37]. Depending on the assigned score, a vulnerability is categorized into one of the following severity classes: • Low: severity in range 0.1-3.9. • Medium: severity in range 4.0-6.9. • High: severity in range 7.0-8.9. • Critical: severity in range 9.0-10.0. All vulnerabilities taken into account that have not a CVE identifier are GitHub GHSA vulnerabilities (see subsection 4.2). GitHub is a CNA, and usually provide CVSS base metrics for vulnerabilities in its database, even if a CVE identifier has not been assigned yet to these vulnerabilities. In detail, all collected GitHub GHSA vulnerabilities have a severity class. 43
  45. 45. In Figure 17 the distribution of vulnerability’s severities among all the vulner- abilities taken into account is presented. It can be seen that more than 40% of vulnerabilities have high severity. The second most present severity class is medium, followed by critical and low. Figure 17: Distribution of vulnerability’s severities among all the vulnerabilities found for the 4 datasets IT, DE, UK and US. 4.6.1 Viewpoint of repositories Now vulnerabilities found are analysed from the viewpoint of the repositories. Figure 18: Percentage of repositories that have at least one high or critical vulnerability for datasets IT, DE, UK, US. Figure 18 shows the percentage of repositories associated with at least one high or critical vulnerability for the 4 datasets. It can be seen that the UK dataset has the highest percentage: almost 40% of the repositories in the dataset use a package for which a vulnerability with a score of 7 or higher is found. Figure 19 shows the distributions of the number of critical and high vulnera- bilities found in each repository; percentage values are computed with respect to the number of repositories with at least one vulnerability high or critical (see Figure 18). It can be seen that for all datasets, most of the repositories are in the range 44
  46. 46. (0,2) and (0,4) critical and high vulnerabilities, respectively. The IT dataset has the highest percentage of repositories associated with more than 11 criti- cal vulnerabilities and 44 high vulnerabilities (the 22.68% of the repositories taken into account). By focusing on ranges with more than 11 and more than 44 critical and high vulnerabilities (cells in the upper right region of each map), it turns out that IT has the highest percentage of repositories in such ranges: 22.68% as opposed to 10.12%, 7.68%, 4.68% of datasets DE, UK and US respectively. Moreover, 1.68% of Italian repositories taken into account are associated to 18 critical vulnerabilities or more and 70 high vulnerabilities or more (the cell in the upper right corner of the map). (a) IT (b) DE 45
  47. 47. (c) UK (d) US Figure 19: Heat maps of the number of high and critical vulnerabilities. The value in each square is the percentage of repositories associated with a number of high and critical vulnerabilities in those ranges. Percentage values are computed only over the repositories associated with at least one high or critical vulnerabilities (see Figure 18). An unresolved vulnerability v is defined as a vulnerability for which no version of the vulnerable package without v has been released. Now only the subset of unresolved vulnerabilities (at the data collection date, October 11th, 2022) is analysed. Among all the vulnerabilities taken into ac- count, only the 14.46% of them are marked as unresolved. Figure 20 shows the percentage of repositories associated with at least one 46
  48. 48. unresolved high or critical vulnerability for the 4 datasets. It can be seen that, as in Figure 18, the UK dataset has the highest percentage. The results in Figure 20 and Figure 18 differ by a few percentage points for all datasets, indicating that most repositories associated with at least one critical or high vulnerability are associated with an unresolved vulnerability. Figure 20: Percentage of repositories associated with at least one unresolved high or critical vulnerability for datasets IT, DE, UK, US. Figure 21 shows the distributions of the number of unresolved critical and high vulnerabilities found in each repository; percentage values are computed with respect to the number of repositories with at least one unresolved vulnerability high or critical (see Figure 20). It can be seen that for all datasets, more than 60% of the repositories are in the range (0,2) and (0,4) unresolved critical and high vulnerabilities, respec- tively. Only the DE dataset has repositories associated with more than 5 unresolved critical vulnerabilities: for the 1.44% of the repositories a number of unre- solved critical vulnerabilities between 6 and 8 and a number of unresolved high vulnerabilities between 10 and 14 is found. (a) IT 47
  49. 49. (b) DE (c) UK 48
  50. 50. (d) US Figure 21: Heat maps of the number of unresolved high and critical vulnerabilities. The value in each square is the percentage of repositories that are associated with a number of unresolved high and critical vulnerabilities in those ranges. Percentage values are computed only over the repositories associated with at least one unresolved high or critical vulnerabilities (see Figure 20). It is then analysedd the distribution of critical or high vulnerabilities among repositories. Figure 22 shows the number of critical or high vulnerabilities per percentage of repositories with at least that number of vulnerabilities. Per- centages are computed over the total number of repositories of each dataset. As the percentage of repositories decreases, the number of vulnerabilities in- creases exponentially for each dataset. One can also see that while for the IT, DE and UK datasets the percentage of repositories associated with more than 40 vulnerabilities is more or less 4%, for the US dataset it is less than 2%. 49
  51. 51. Figure 22: Number of critical or high vulnerabilities per percentage of repositories with at least that number of vulnerabilities. Percentages are computed over the total number of repositories of each dataset. 4.6.2 Viewpoint of packages Now vulnerabilities found are analysed from the viewpoint of the packages. Figure 23: Distribution of vulnerabilities found among the various package managers. Figure 23 shows the distribution of vulnerabilities found among the various package managers. It can be seen that most of the vulnerabilities come from packages of pypi; immediately afterwards npm can be found, followed by maven and gem. Less than 1% of the vulnerabilities came from cargo or golang. However, Figure 23 does not take into account the frequency of use of each package manager. Table 7 shows mean and variance values of the number of vulnerabilities per package for each package manager. It can be seen that, although more npm vulnerabilities were found than maven and gem ones, on 50
  52. 52. average a maven or a gem package is more vulnerable than an npm package. It can also be seen that the variance values of the distributions for pypi and maven are the highest ones; this indicates that there are few packages with many vulnerabilities. In particular, the variance in pypi is even greater than 90. This is explained by the presence of 3 (package name, package version) pypi pairs that are related to the open source platform TensorFlow: • (tensorflow, 1.14.0) • (tensorflow, 2.4.0) • (tensorflow-gpu, 1.14.0) These packages provide Python APIs for using the TensorFlow platform, and all the TensorFlow vulnerabilities (more than 300) are associated with each of them. If these packages were removed from the analysis, the mean and the variance of the number of vulnerabilities per pypi package would become 0.23 and 2.54 respectively. Vulnerable pairs Vulnerabilities per pair Package manager % Average Variance pypi 7.35 0.51 92.86 npm 3.23 0.06 0.18 maven 8.23 0.31 6.28 gem 11.44 0.35 2.66 Table 7: Percentage of (package name, package version) pairs found to be vulnerable (percentages are computed over the total number of packages found for that package manager). Average and variance of the number of vulnerabilities per pair. In addition, Table 7 gives the percentage of pairs (package name, package ver- sion) with at least one vulnerability for each package manager. The percent- ages are calculated over the total number of packages found for each package manager. It can be seen that the percentages just reported for npm and maven are simi- lar to the data reported by Sonatype: according to their 2021 SSSC Report [7], 8.4% of the pairs (package, version) housed in the maven central and 2.2% of those housed in npm contain at least one known vulnerability. However the SSSC Report states that only 0.5% of the pairs (package,version) housed in the pypi repository contain at least one vulnerability: therefore, on average, the pypi packages used by the 4 datasets are much more vulnerable than the average of all packages housed in the pypi repository. Severity Total >= Low >= Medium >= High Critical 62 710 3 073 (4.90%) 3 004 (4.79%) 2 104 (3.35%) 637 (1.02%) Table 8: Total number of (package name, package version) pairs found and number of pairs found that have at least one vulnerability with severity higher then: high, medium, low or critical. 51
  53. 53. Table 8 shows the number of (package name, package version) pairs found with at least one vulnerability with a severity higher than: high, medium, low or critical. Vulnerability ID Severity Package Pairs Repos. CVE-2022-3517 High minimatch (npm) 6 443 CVE-2021-43809 High bundler (gem) 44 392 CVE-2021-44906 Critical minimist (npm) 10 374 GHSA-2qc6-mcvw-92cw Medium nokogiri (gem) 51 351 CVE-2020-28469 High glob-parent (npm) 5 303 CVE-2020-36327 High bundler (gem) 21 276 CVE-2021-3918 Critical json-schema (npm) 1 263 CVE-2022-30122 Medium rack (gem) 33 263 CVE-2022-30123 High rack (gem) 33 263 CVE-2022-29181 High nokogiri (gem) 48 260 Table 9: The table reports the top 10 vulnerabilities per number of potentially vulnerable repositories. The table shows: the severity of the vulnerability, the vulnerable package, the number of the (package name, package version) pairs affected and the number of potentially vulnerable repositories. Table 9 shows the top 10 vulnerabilities per number of potentially vulnerable repositories. The repositories considered are those of all 4 datasets IT, DE, UK and US. Each vulnerability is associated with several versions of a single package. It can be seen that most of the vulnerabilities have high severity and are related to JavaScript or Ruby packages. Moreover, each of the top 5 vulnerabilities in the table makes more than 300 repositories potentially vulnerable. In addition, the following data are reported: • Each of the top 80 vulnerabilities per number of potentially vulnerable repositories make more than 100 repositories vulnerable. Moreover, all of them come from JavaScript or Ruby packages. • Together, vulnerabilities CVE-2022-3517, CVE-2021-43809 and GHSA- 2qc6-mcvw-92cw (first, second and fourth position in Table 9 respec- tively) make potentially vulnerable 880 out of 4551 repositories (almost the 20%). Figure 24 shows the percentages of pairs with a number of high and critical vulnerabilities in the indicated ranges. The percentages are calculated over the number of pairs with at least one vulnerability of severity high or higher. It can be seen that almost 50% of the vulnerable pairs have only one vulner- ability with high severity, and just over 16% have only one vulnerability with critical severity. Moreover, only the 6.5% of the vulnerable pairs taken into account have more than 1 critical vulnerability, while the 28.8% have more than 1 high vulnerability. The results in Figure 24 change slightly when only unresolved vulnerabili- ties are taken into account: 67.44% of packages with at least one unresolved high or critical vulnerability have exactly one unresolved critical vulnerability. Among the others, 14.05% have exactly one high unresolved vulnerability. 52
  54. 54. Figure 24: Heat maps of the number of high and critical vulnerabilities. The value in each square is the percentage of (package name, package version) pairs that have a number of high and critical vulnerabilities in those ranges. Percentage values are computed only over the (package name, package version) pairs with at least one high or critical vulnerabilities (see Table 8) . 53
  55. 55. 5 Discussion and conclusions In this work, an attempt was made to use a SBoM building tool to systemati- cally create SBoMs for a set of GitHub repositories. Then, these SBoMs have been used to obtain information on supply chain dependencies and to perform vulnerability analyses. A special focus was put on the data used: 4 datasets containing repositories of the Italian, German, British and American public administration were con- structed. A rigorous process was defined in order to obtain the dependencies and vul- nerabilities associated with a repository and then investigating the differences between the 4 datasets. Data of the 4 datasets come from a set of GitHub organizations related to the public administration of different countries of the world. Among these organizations, it was seen that the United States and the United Kingdom have the highest number of organizations and repositories. It was also noted that among the 16 countries with the most number of organizations, the most popular languages are Python and JavaScript. For the US dataset, the most popular type of repository is the one that does not contain code. For the other datasets, IT, DE and UK, the most popular repositories are Python, JavaScript and Ruby ones respectively. During the process of creating the SBoMs, some critical issues emerged: the tool for generating SBoMs does not allow for full management of conditional dependencies, behaving differently depending on the ecosystem. In addition, non-stringent version constraints of a package are solved by using the latest version released at the time of SBoM creation, without defining a standardised procedure to specify the date of last update of the software artifact. Moreover, a problem in obtaining Java vulnerabilities for packages was en- countered, due to a lack of full compatibility between the SBoM creation tool and the tool used for vulnerability analysis. All the dependencies found for the four datasets were then analysed; it was seen that, on average, the DE dataset has the highest number of dependencies per repository, while the US dataset has the fewest. It was noted that for all datasets, the largest number of dependencies came from the JavaScript pack- age manager npm. The 30 packages for which the highest number of repositories depend were taken into account, and it was seen that most of these are JavaScript packages for all the 4 datasets. It was also observed that many JavaScript dependencies come from non-JavaScript repositories. The vulnerabilities collected were then analysed. It was seen that the UK dataset, with the 40%, is the one with the highest percentage of repositories 54
  56. 56. associated with at least one high or critical vulnerability. The IT dataset is the one with the highest percentage of repositories associ- ated with a large number of high and critical vulnerabilities. In fact, 22.68% of the repositories with at least one high or critical vulnerability have more than 44 high vulnerabilities and more than 11 critical vulnerabilities. Among all the vulnerabilities, the focus was then put on the subset of un- resolved ones: an unresolved vulnerability v is defined as a vulnerability for which no version of the vulnerable package without v has been released. De- spite only the 14.46% of all the vulnerabilities found are unresolved, the per- centages of repositories with at least one unresolved critical or high vulner- abilities are only slightly lower than the percentages of repositories with at least one critical or high vulnerability. Taking into account vulnerabilities from the viewpoint of the packages, it was seen that the package manager with the highest number of vulnerable packages on average is the Ruby one. Infact the 11.44% of the (package name, package version) pairs from RubyGems registered at least one high or critical vulnerability. Furthermore, although only 4.90% of the pairs (package name, package version) were found to be vulnerable, it is observed that only 3 pack- age made almost 20% of all repositories potentially vulnerable. The conclusion is drawn by observing how the use of SBoM standards can be an effective way of systematically keeping track of the elements of a soft- ware artifact’s supply chain; the presence of rigorous standards provides the possibility of being able to analyse the dependencies and their vulnerabilities in a programmatic and continuous way. However, the SBoMs creation tool taken into consideration still presents crit- icalities and there is not complete compatibility between the creation and analysis tools used. In detail, it is not possible yet to define a procedure for the creation and analysis of a SBoM that applies to a GitHub repository and that is: • Accurate: packages with non-fixed versions are inserted into the SBoM with the latest version existing at the time the SBoM is built, not at the time of the latest project update. Moreover, conditions of conditional dependencies are ignored. • Consistent: across different ecosystems. Conditional dependencies are handled with a predefined behaviour different for each ecosystem. • Complete: starting from the SBoM, vulnerabilities associated with Maven packages are not found. Finally, in the next paragraphs, open problems and possible future works are discussed. A limitation of a vulnerability analysis such as the one carried out in this work is that the vulnerabilities found are potential vulnerabilities for reposi- tories. A vulnerability in a package used by one of the analysed repositories 55
  57. 57. is a potential vulnerability for that repository. A potential vulnerability ac- tually becomes a vulnerability when it is exploitable in the target repository; whether the vulnerability is exploitable depends on the parts of the package used and how they are used by the repository itself. Figuring out whether or not a third-party vulnerability affects a repository is an open question for defining the actual impact of that vulnerability. Furthermore, it is important to understand whether the maintainers of a repos- itory are aware of potential vulnerabilities in their repository. GitHub provides the Dependabot [38] tool that generates alerts when it identifies a project de- pendency with a vulnerability. To the best of the author’s research, there appears to be no studies on the behaviour of maintainers, i.e. whether they analyse the presence of vulnerabilities and their potential impact or not. In addition, there seems to be no convention on how to indicate if a vulnerability of a certain third-party dependency has been analysed or not. Understanding whether the repository is actually vulnerable and providing this information in a standardized way would be useful for developers that decide to use the repository. Moreover, having this information would make vulnerability anal- ysis more precise, allowing analysis to be limited to only those vulnerabilities that can be exploited from the code contained in the repository. An analysis that could be the subject of future studies concerns the life cycle of a vulnerability with respect to the repository. Understanding whether a po- tential vulnerability was already known at the time of the commit that caused it may be important. Such an analysis would provide a better understanding of the behaviour of maintainers: do they take potential vulnerabilities into account when using third-party packages in their open source projects? Another future work could concern differences in the number of vulnerabil- ities across repositories. It might be interesting to understand whether these differences are due to the number of third-party packages used, the type of these packages or specific actions of the maintainers. Repositories that are less vulnerable than others, for instance, could indicate maintainers analysing the presence of potential vulnerabilities and removing vulnerable dependencies. 56
  58. 58. 6 Appendix This appendix contains descriptions of the software developed for data col- lected and database construction. Each functional unit will be described in a subsection. The data collected by the functional units are then stored into a sqlite database with the structure described in subsection 4.2. To do this, Python’s sqlite3 module is used. Each time a record is inserted into a table, if a record with that primary key already exists, the insertion will be ignored. It is also observed that the functional units described in this appendix do not cover the entire code realised. In particular, the scripts that store data in the database and the scripts that use functional units have been omitted. Moreover also the scripts that collect vulnerability metadata have been omit- ted, since the data obtained from them are not discussed in this work. 57
  59. 59. 6.1 GitHub & Government list Type: Script. Language: Python. Tools and external libraries used: BeautifulSoup [39], requests [40]. Input: GitHub & Government list web page URL. Output: Python list of (username, section, category) tuples, one for each organization in the GitHub & Government list. Description: The Python script uses Python’s BeautifulSoup library to scrap the GitHub & Government page and obtain the data described in subsec- tion 3.1. To obtain the web page, Python’s requests library is used; to obtain the user- name and subsection name from the html content, two regular expressions are used. Below is a snippet with the code core: URL = 'https://government.github.com/community/'# GitHub and Government url resp = requests.get(URL) # Get webpage soup = BeautifulSoup(resp.text, 'lxml') # BeautifulSoup initialization orgs_names = soup.select('div.org-name') orgs=list() for on in orgs_names: # '@([w,-]+)': Regex for GitHub username username=re.compile(r'@([w,-]+)').search(on.text).groups()[0] section=on.find_previous('h2').text.strip() subsec=on.find_previous('h3').text # '(.*)([0-9]+)': Regex for subsection name subsec=re.compile(r'(.*)([0-9]+)').search(subsec).groups()[0].strip() orgs.append((username,section,subsec)) 58

×