Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.
STRIP: stream learning of influence probabilities.Albert Bifet
Influence-driven diffusion of information is a fundamental process in social networks. Learning the latent variables of such process, i.e., the influence strength along each link, is a central question towards understanding the structure and function of complex networks, modeling information cascades, and developing applications such as viral marketing.
Motivated by modern microblogging platforms, such as twitter, in this paper we study the problem of learning influence probabilities in a data-stream scenario, in which the network topology is relatively stable and the challenge of a learning algorithm is to keep up with a continuous stream of tweets using a small amount of time and memory. Our contribution is a number of randomized approximation algorithms, categorized according to the available space (superlinear, linear, and sublinear in the number of nodes n) and according to different models (landmark and sliding window). Among several results, we show that we can learn influence probabilities with one pass over the data, using O(nlog n) space, in both the landmark model and the sliding-window model, and we further show that our algorithm is within a logarithmic factor of optimal.
For truly large graphs, when one needs to operate with sublinear space, we show that we can still learn influence probabilities in one pass, assuming that we restrict our attention to the most active users.
Our thorough experimental evaluation on large social graph demonstrates that the empirical performance of our algorithms agrees with that predicted by the theory.
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.
STRIP: stream learning of influence probabilities.Albert Bifet
Influence-driven diffusion of information is a fundamental process in social networks. Learning the latent variables of such process, i.e., the influence strength along each link, is a central question towards understanding the structure and function of complex networks, modeling information cascades, and developing applications such as viral marketing.
Motivated by modern microblogging platforms, such as twitter, in this paper we study the problem of learning influence probabilities in a data-stream scenario, in which the network topology is relatively stable and the challenge of a learning algorithm is to keep up with a continuous stream of tweets using a small amount of time and memory. Our contribution is a number of randomized approximation algorithms, categorized according to the available space (superlinear, linear, and sublinear in the number of nodes n) and according to different models (landmark and sliding window). Among several results, we show that we can learn influence probabilities with one pass over the data, using O(nlog n) space, in both the landmark model and the sliding-window model, and we further show that our algorithm is within a logarithmic factor of optimal.
For truly large graphs, when one needs to operate with sublinear space, we show that we can still learn influence probabilities in one pass, assuming that we restrict our attention to the most active users.
Our thorough experimental evaluation on large social graph demonstrates that the empirical performance of our algorithms agrees with that predicted by the theory.
Big Data and the Internet of Things (IoT) have the potential
to fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from the Internet of Things (IoT) has
been recognized as one of the most exciting and key opportunities for
both academia and industry. Advanced analysis of big data streams from
sensors and devices is bound to become a key area of data mining
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an
overview of data stream mining, and I will introduce
some popular open source tools for data stream mining.
Big Data is a new term used in Business Analytics to identify datasets that we can not manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. New mining techniques are necessary due to the volume, variability, and velocity, of such data.
In this talk, we will focus on advanced techniques in Big Data mining in real time using evolving data stream techniques: using a small amount of time and memory resources, and being able to adapt to changes. We will discuss a social network application of data stream mining to compute user influence probabilities. And finally, we will present the MOA software framework with classification, regression, and frequent pattern methods, and the SAMOA distributed streaming software that runs on top of Storm, Samza and S4.
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in real-time. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this talk we present a framework for studying graph pattern mining on time-varying streams and large datasets.
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
An overview of streaming algorithms: what they are, what the general principles regarding them are, and how they fit into a big data architecture. Also four specific examples of streaming algorithms and use-cases.
Multi-Armed Bandits: Intro, examples and tricksIlias Flaounas
In this talk Ilias will discuss some variations of the Multi-Armed Bandits (MABs), a less popular although important area of Machine Learning. MABs enable us to build adaptive systems capable of finding solutions for tasks based on the interactions with their environment. MABs solve a task by acquiring useful knowledge at every step of an iterative process while they balance the exploration-exploitation dilemma. They are used to tackle practical problems like selecting appropriate online ads and personalized content for presentation to users; assigning people to cohorts in controlled trials; supporting decision making and more. To solve these kinds of problems solutions need to be identified as fast as possible since accepting errors can be costly. Ilias will discuss some examples from industry and academia as well as some of the related work at Atlassian.
survey slides for contextual bandit
main reference: Li Zhou. A Survey on Contextual Multi-armed Bandits. arXiv, 2015. (https://arxiv.org/abs/1508.03326)
Presentation for the Softskills Seminar course @ Telecom ParisTech. Topic is the paper by Domings Hulten "Mining high speed data streams". Presented by me the 30/11/2017
Keynote of HOP-Rec @ RecSys 2018
Presenter: Jheng-Hong Yang
These slides aim to be a complementary material for the short paper: HOP-Rec @ RecSys18. It explains the intuition and some abstract idea behind the descriptions and mathematical symbols by illustrating some plots and figures.
Apache Drill is a new open source Apache Incubator project for interactive analysis of large-scale datasets, inspired by Google's Dremel. It enables users to query terabytes of data in seconds. Apache Drill supports a broad range of data formats, including Protocol Buffers, Avro and JSON, and leverages Hadoop and HBase as data sources. Drill's primary query language, DrQL, is compatible with Google BigQuery. In this talk we provide an overview of the Drill project, including its design goals and architecture.
Presenter: Jason Frantz, Software Architect, MapR Technologies
Big Data and the Internet of Things (IoT) have the potential
to fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from the Internet of Things (IoT) has
been recognized as one of the most exciting and key opportunities for
both academia and industry. Advanced analysis of big data streams from
sensors and devices is bound to become a key area of data mining
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an
overview of data stream mining, and I will introduce
some popular open source tools for data stream mining.
Big Data is a new term used in Business Analytics to identify datasets that we can not manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. New mining techniques are necessary due to the volume, variability, and velocity, of such data.
In this talk, we will focus on advanced techniques in Big Data mining in real time using evolving data stream techniques: using a small amount of time and memory resources, and being able to adapt to changes. We will discuss a social network application of data stream mining to compute user influence probabilities. And finally, we will present the MOA software framework with classification, regression, and frequent pattern methods, and the SAMOA distributed streaming software that runs on top of Storm, Samza and S4.
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in real-time. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this talk we present a framework for studying graph pattern mining on time-varying streams and large datasets.
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
An overview of streaming algorithms: what they are, what the general principles regarding them are, and how they fit into a big data architecture. Also four specific examples of streaming algorithms and use-cases.
Multi-Armed Bandits: Intro, examples and tricksIlias Flaounas
In this talk Ilias will discuss some variations of the Multi-Armed Bandits (MABs), a less popular although important area of Machine Learning. MABs enable us to build adaptive systems capable of finding solutions for tasks based on the interactions with their environment. MABs solve a task by acquiring useful knowledge at every step of an iterative process while they balance the exploration-exploitation dilemma. They are used to tackle practical problems like selecting appropriate online ads and personalized content for presentation to users; assigning people to cohorts in controlled trials; supporting decision making and more. To solve these kinds of problems solutions need to be identified as fast as possible since accepting errors can be costly. Ilias will discuss some examples from industry and academia as well as some of the related work at Atlassian.
survey slides for contextual bandit
main reference: Li Zhou. A Survey on Contextual Multi-armed Bandits. arXiv, 2015. (https://arxiv.org/abs/1508.03326)
Presentation for the Softskills Seminar course @ Telecom ParisTech. Topic is the paper by Domings Hulten "Mining high speed data streams". Presented by me the 30/11/2017
Keynote of HOP-Rec @ RecSys 2018
Presenter: Jheng-Hong Yang
These slides aim to be a complementary material for the short paper: HOP-Rec @ RecSys18. It explains the intuition and some abstract idea behind the descriptions and mathematical symbols by illustrating some plots and figures.
Apache Drill is a new open source Apache Incubator project for interactive analysis of large-scale datasets, inspired by Google's Dremel. It enables users to query terabytes of data in seconds. Apache Drill supports a broad range of data formats, including Protocol Buffers, Avro and JSON, and leverages Hadoop and HBase as data sources. Drill's primary query language, DrQL, is compatible with Google BigQuery. In this talk we provide an overview of the Drill project, including its design goals and architecture.
Presenter: Jason Frantz, Software Architect, MapR Technologies
Bagging Decision Trees on Data Sets with Classification NoiseNTNU
In many of the real applications of supervised classification techniques, the data sets employed to learn the models contains classification noise (some instances of the data set have wrong assignations of the class label), principally due to deficiencies in the data capture process. Bagging ensembles of decision trees are considered to be one of the most outperforming supervised classification models in these situations. In this paper, we propose Bagging ensemble of credal decision trees, which are based on imprecise probabilities, via the Imprecise Dirichlet model, and information based uncertainty measures, via the maximum of entropy function. We remark that our method can be applied on data sets with continuous variables and missing data. With an experimental study, we prove that Bagging credal decision trees outperforms more complex Bagging approaches in data sets with classification noise. Furthermore, using a bias-variance error decomposition analysis, we also justify the performance of our approach showing that it achieves a stronger and more robust reduction of the variance error component.
Drill Down is an economic development tool designed for jurisdictions with limited staff and/or capacity to conduct business retention and expansion programming
What is an "ensemble learner"? How can we combine different base learners into an ensemble in order to improve the overall classification performance? In this lecture, we are providing some answers to these questions.
Video: http://joyent.com/blog/linux-performance-analysis-and-tools-brendan-gregg-s-talk-at-scale-11x ; This talk for SCaLE11x covers system performance analysis methodologies and the Linux tools to support them, so that you can get the most out of your systems and solve performance issues quickly. This includes a wide variety of tools, including basics like top(1), advanced tools like perf, and new tools like the DTrace for Linux prototypes.
Inria Tech Talk - La classification de données complexes avec MASSICCCStéphanie Roger
MASSICCC - Une plateforme SaaS pour le traitement de la classification de données complexes hétérogènes et incomplètes.
Dans ce Tech Talk venez découvrir, tester et apprendre à maîtriser MASSICCC (Massive clustering in cloud computing) une plateforme SaaS orientée utilisateurs, ainsi que ses trois familles d’algorithmes de #classification, fruits des dernières avancées des équipes de recherche Modal & Celeste de Inria, pour analyser et faire de l’apprentissage sur vos "Big Data" (ex : en immobilier, maintenance prédictive, santé, open data, etc. ).
MASSICCC c’est aussi :
- Un accès gratuit pour le test et la recherche sur https://massiccc.lille.inria.fr
- Un "one for all" de la classification
- Une forte interprétabilité des résultats (avec ses graphiques)
- Un mode SaaS qui vous permet un suivi des expériences (en cours ou terminées)
- Et des algorithmes open source qui sont réutilisables indépendamment.
Using AI Planning to Automate the Performance Analysis of SimulatorsRoland Ewald
Analyzing simulation algorithm performance is cumbersome: execute some runs, observe a performance metric, and analyze the results. Often, the results motivate follow-up experiments, which in turn may lead to additional experiments, and so on. This time-consuming and error-prone process can be automated with planning approaches from artificial intelligence, making simulator performance analysis more convenient and rigorous. This paper introduces ALeSiA, a prototypical system for automatic simulator performance analysis. It is independent of any specific simulation system and realizes a hypothesis-driven approach to evaluate performance.
Multi Model Ensemble (MME) predictions are a popular ad-hoc technique for improving predictions of high-dimensional, multi-scale dynamical systems. The heuristic idea behind MME framework is simple: given a collection of models, one considers predictions obtained through the convex superposition of the individual probabilistic forecasts in the hope of mitigating model error. However, it is not obvious if this is a viable strategy and which models should be included in the MME forecast in order to achieve the best predictive performance. I will present an information-theoretic approach to this problem which allows for deriving a sufficient condition for improving dynamical predictions within the MME framework; moreover, this formulation gives rise to systematic and practical guidelines for optimising data assimilation techniques which are based on multi-model ensembles. Time permitting, the role and validity of “fluctuation-dissipation” arguments for improving imperfect predictions of externally perturbed non-autonomous systems - with possible applications to climate change considerations - will also be addressed.
Parallelisation of the PC Algorithm (CAEPIA2015)AMIDST Toolbox
This paper describes a parallel version of the PC algorithm for learning the structure of a Bayesian network from data. The PC algorithm is a constraint-based algorithm consisting of five steps where the first step is to perform a set of (conditional) independence tests while the remaining four steps relate to identifying the structure of the Bayesian network using the results of the (conditional) independence tests. In this paper, we describe a new approach to parallelisation of the (conditional) independence testing as experiments illustrate that this is by far the most time consuming step. The proposed parallel PC algorithm is evaluated on data sets generated at random from five different real- world Bayesian networks. The results demonstrate that significant time performance improvements are possible using the proposed algorithm.
Full text link: http://people.cs.aau.dk/~tdn/papers/madsen_amidst_caepia.pdf
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
In this talk, we present Apache SAMOA, an open-source platform for
mining big data streams with Apache Flink, Storm and Samza. Real time analytics is
becoming the fastest and most efficient way to obtain useful knowledge
from what is happening now, allowing organizations to react quickly
when problems appear or to detect new trends helping to improve their
performance. Apache SAMOA includes algorithms for the most common
machine learning tasks such as classification and clustering. It
provides a pluggable architecture that allows it to run on Apache
Flink, but also with other several distributed stream processing
engines such as Storm and Samza.
Multi-label Classification with Meta-labelsAlbert Bifet
The area of multi-label classification has rapidly developed in recent years. It has become widely known that the baseline binary relevance approach suffers from class imbalance and a restricted hypothesis space that negatively affects its predictive performance, and can easily be outperformed by methods which learn labels together. A number of methods have grown around the label powerset approach, which models label combinations together as class values in a multi-class problem. We describe the label-powerset-based solutions under a general framework of \emph{meta-labels}. We provide theoretical justification for this framework which has been lacking, by viewing meta-labels as a hidden layer in an artificial neural network. We explain how meta-labels essentially allow a random projection into a space where non-linearities can easily be tackled with established linear learning algorithms. The proposed framework enables comparison and combination of related approaches to different multi-label problems. Indeed, we present a novel model in the framework and evaluate it empirically against several high-performing methods, with respect to predictive performance and scalability, on a number of datasets and evaluation metrics. Our deployment of an ensemble of meta-label classifiers obtains competitive accuracy for a fraction of the computation required by the current meta-label methods for multi-label classification.
Big Data is a new term used to identify datasets that we can not manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. New mining techniques are necessary due to the volume, variability, and velocity, of such data.
Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Evolving data streams are contributing to the growth of data created over the last few years. We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. Evolving data streams methods are becoming a low-cost, green methodology for real time online prediction and analysis. We discuss the current and future trends of mining evolving data streams, and the challenges that the field will have to overcome during the next years.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
Leveraging Bagging for Evolving Data Streams
1. Leveraging Bagging for Evolving Data Streams
Albert Bifet, Geoff Holmes, and Bernhard Pfahringer
University of Waikato
Hamilton, New Zealand
Barcelona, 21 September 2010
ECML PKDD 2010
2. Mining Data Streams with Concept Drift
Extract information from
potentially infinite sequence of data
possibly varying over time
using few resources
Adaptively:
no prior knowledge of type or rate of change
2 / 32
3. Mining Data Streams with Concept Drift
Extract information from
potentially infinite sequence of data
possibly varying over time
using few resources
Leveraging Bagging
New improvements for adaptive bagging methods using
input randomization
output randomization
2 / 32
4. Outline
1 Data stream constraints
2 Leveraging Bagging for Evolving Data Streams
3 Empirical evaluation
3 / 32
5. Outline
1 Data stream constraints
2 Leveraging Bagging for Evolving Data Streams
3 Empirical evaluation
4 / 32
6. Mining Massive Data
Eric Schmidt, August 2010
Every two days now we create as much information as we did
from the dawn of civilization up until 2003.
5 exabytes of data
5 / 32
7. Data stream classification cycle
1 Process an example at a
time, and inspect it only
once (at most)
2 Use a limited amount of
memory
3 Work in a limited amount
of time
4 Be ready to predict at
any point
6 / 32
8. Mining Massive Data
Koichi Kawana
Simplicity means the achievement of maximum effect with
minimum means.
time
accuracy
memory
Data Streams
7 / 32
9. Evaluation Example
Accuracy Time Memory
Classifier A 70% 100 20
Classifier B 80% 20 40
Which classifier is performing better?
8 / 32
11. Evaluation Example
Accuracy Time Memory RAM-Hours
Classifier A 70% 100 20 2,000
Classifier B 80% 20 40 800
Which classifier is performing better?
10 / 32
12. Outline
1 Data stream constraints
2 Leveraging Bagging for Evolving Data Streams
3 Empirical evaluation
11 / 32
13. Hoeffding Trees
Hoeffding Tree : VFDT
Pedro Domingos and Geoff Hulten.
Mining high-speed data streams. 2000
With high probability, constructs an identical model that a
traditional (greedy) method would learn
With theoretical guarantees on the error rate
Time
Contains “Money”
YES
Yes
NO
No
Day
YES
Night
12 / 32
14. Hoeffding Naive Bayes Tree
Hoeffding Tree
Majority Class learner at leaves
Hoeffding Naive Bayes Tree
G. Holmes, R. Kirkby, and B. Pfahringer.
Stress-testing Hoeffding trees, 2005.
monitors accuracy of a Majority Class learner
monitors accuracy of a Naive Bayes learner
predicts using the most accurate method
13 / 32
16. Bagging
Figure: Poisson(1) Distribution.
Each base model’s training set contains each of the original
training example K times where P(K = k) follows a binomial
distribution.
14 / 32
17. Oza and Russell’s Online Bagging for M models
1: Initialize base models hm for all m ∈ {1,2,...,M}
2: for all training examples do
3: for m = 1,2,...,M do
4: Set w = Poisson(1)
5: Update hm with the current example with weight w
6: anytime output:
7: return hypothesis: hfin(x) = argmaxy∈Y ∑T
t=1 I(ht (x) = y)
15 / 32
18. ADWIN Bagging (KDD’09)
ADWIN
An adaptive sliding window whose size is recomputed online
according to the rate of change observed.
ADWIN has rigorous guarantees (theorems)
On ratio of false positives and negatives
On the relation of the size of the current window and
change rates
ADWIN Bagging
When a change is detected, the worst classifier is removed and
a new classifier is added.
16 / 32
19. ADWIN Bagging for M models
1: Initialize base models hm for all m ∈ {1,2,...,M}
2: for all training examples do
3: for m = 1,2,...,M do
4: Set w = Poisson(1)
5: Update hm with the current example with weight w
6: if ADWIN detects change in error of one of the
classifiers then
7: Replace classifier with higher error with a new one
8: anytime output:
9: return hypothesis: hfin(x) = argmaxy∈Y ∑T
t=1 I(ht (x) = y)
17 / 32
20. Leveraging Bagging for Evolving
Data Streams
Randomization as a powerful tool to increase accuracy and
diversity
There are three ways of using randomization:
Manipulating the input data
Manipulating the classifier algorithms
Manipulating the output targets
18 / 32
22. ECOC Output Randomization
Table: Example matrix of random output codes for 3 classes and 6
classifiers
Class 1 Class 2 Class 3
Classifier 1 0 0 1
Classifier 2 0 1 1
Classifier 3 1 0 0
Classifier 4 1 1 0
Classifier 5 1 0 1
Classifier 6 0 1 0
20 / 32
23. Leveraging Bagging for Evolving Data Streams
Leveraging Bagging
Using Poisson(λ)
Leveraging Bagging MC
Using Poisson(λ) and Random Output Codes
Fast Leveraging Bagging ME
if an instance is misclassified: weight = 1
if not: weight = eT /(1−eT ),
21 / 32
24. Input Randomization
Bagging
resampling with replacement using Poisson(1)
Other Strategies
subagging
resampling without replacement
half subagging
resampling without replacement half of the instances
bagging without taking out any instance
using 1+Poisson(1)
22 / 32
25. Leveraging Bagging for Evolving Data Streams
1: Initialize base models hm for all m ∈ {1,2,...,M}
2:
3: for all training examples (x,y) do
4: for m = 1,2,...,M do
5: Set w = Poisson(λ)
6: Update hm with the current example with weight w
7: if ADWIN detects change in error of one of the classifiers
then
8: Replace classifier with higher error with a new one
9: anytime output:
10: return hfin(x) = argmaxy∈Y ∑T
t=1 I(ht (x) = y)
23 / 32
26. Leveraging Bagging for Evolving Data Streams MC
1: Initialize base models hm for all m ∈ {1,2,...,M}
2: Compute coloring µm(y)
3: for all training examples (x,y) do
4: for m = 1,2,...,M do
5: Set w = Poisson(λ)
6: Update hm with the current example with weight w and
class µm(y)
7: if ADWIN detects change in error of one of the classifiers
then
8: Replace classifier with higher error with a new one
9: anytime output:
10: return hfin(x) = argmaxy∈Y ∑T
t=1 I(ht (x) = µt (y))
23 / 32
27. Leveraging Bagging for Evolving Data Streams ME
1: Initialize base models hm for all m ∈ {1,2,...,M}
2:
3: for all training examples (x,y) do
4: for m = 1,2,...,M do
5: Set w = 1 if misclassified, otherwise eT /(1−eT )
6: Update hm with the current example with weight w
7: if ADWIN detects change in error of one of the classifiers
then
8: Replace classifier with higher error with a new one
9: anytime output:
10: return hfin(x) = argmaxy∈Y ∑T
t=1 I(ht (x) = y)
23 / 32
28. Outline
1 Data stream constraints
2 Leveraging Bagging for Evolving Data Streams
3 Empirical evaluation
24 / 32
29. What is MOA?
{M}assive {O}nline {A}nalysis is a framework for mining data
streams.
Based on experience with Weka and VFML
Focussed on classification trees, but lots of active
development: clustering, item set and sequence mining,
regression
Easy to extend
Easy to design and run experiments
25 / 32
30. MOA: the bird
The Moa (another native NZ bird) is not only flightless, like the
Weka, but also extinct.
26 / 32
31. Leveraging Bagging Empirical evaluation
Accuracy
75
77
79
81
83
85
87
89
91
93
951000080000150000220000290000360000430000500000570000640000710000780000850000920000990000
Instances
Accuracy(%)
Leveraging Bagging
Leveraging Bagging MC
ADWIN Bagging
Online Bagging
Figure: Accuracy on dataset SEA with three concept drifts.
27 / 32
32. Empirical evaluation
Accuracy RAM-Hours
Hoeffding Tree 74.03% 0.01
Online Bagging 77.15% 2.98
ADWIN Bagging 79.24% 1.48
ADWIN Half Subagging 78.36% 1.04
ADWIN Subagging 78.68% 1.13
ADWIN Bagging WT 81.49% 2.74
ADWIN Bagging Strategies
half subagging
resampling without replacement half of the instances
subagging
resampling without replacement
WT: bagging without taking out any instance
using 1+Poisson(1)
28 / 32
33. Empirical evaluation
Accuracy RAM-Hours
Hoeffding Tree 74.03% 0.01
Online Bagging 77.15% 2.98
ADWIN Bagging 79.24% 1.48
Leveraging Bagging 85.54% 20.17
Leveraging Bagging MC 85.37% 22.04
Leveraging Bagging ME 80.77% 0.87
Leveraging Bagging
Leveraging Bagging
Using Poisson(λ)
Leveraging Bagging MC
Using Poisson(λ) and Random Output Codes
Leveraging Bagging ME
Using weight 1 if misclassified, otherwise eT /(1−eT )
29 / 32
34. Empirical evaluation
Accuracy RAM-Hours
Hoeffding Tree 74.03% 0.01
Online Bagging 77.15% 2.98
ADWIN Bagging 79.24% 1.48
Random Forest Leveraging Bagging 80.69% 5.51
Random Forest Online Bagging 72.91% 1.30
Random Forest ADWIN Bagging 74.24% 0.89
Random Forests
the input training set is obtained by sampling with
replacement
the nodes of the tree use only (n) random attributes to
split
we only keep statistics of these attributes.
30 / 32
35. Leveraging Bagging Diversity
0
0,02
0,04
0,06
0,08
0,1
0,12
0,14
0,16
0,82 0,84 0,86 0,88 0,9 0,92 0,94 0,96
Kappa Statistic
Error
Leveraging Bagging
Online Bagging
Figure: Kappa-Error diagrams for Leveraging Bagging and Online
bagging (bottom) on on the SEA data with three concept drifts,
plotting 576 pairs of classifiers.
31 / 32
36. Summary
http://moa.cs.waikato.ac.nz/
Conclusions
New improvements for bagging methods using input
randomization
Improving Accuracy: Using Poisson(λ)
Improving RAM-Hours: Using weight 1 if misclassified,
otherwise eT /(1−eT )
New improvements for bagging methods using output
randomization
No need for multi-class classifiers
32 / 32