SlideShare ist ein Scribd-Unternehmen logo
1 von 36
James Casey CERN IT Department Grid Technologies Group FUSE Community Day, London, 2010 Using ActiveMQ at CERNfor the Large Hadron Collider
Overview What we do at CERN Current ActiveMQ Usage Monitoring a distributed infrastructure Lessons Learned Future ActiveMQ Usage Building a generic messaging service
LHC is a very large scientific instrument… CMS LHCb ALICE ATLAS Large Hadron Collider 27 km circumference Lake Geneva
… based on advanced technology 27 km of superconducting magnetscooled in superfluid helium at 1.9 K
What are we looking for? To answer fundamental questions about the construction of the universe Why have we got mass ? (Higgs Boson) Search for a Grand Unified Theory Supersymmetry Dark Matter, Dark Energy Antimatter/matter asymmetry
This Requires……. 1. Accelerators : powerful machines that accelerate particles to extremely high energies and then bring them into collision with other particles 2. Detectors : gigantic instruments that record the resulting particles as they “stream” out from the point of collision. 4. People : Only a worldwide collaboration of thousands of scientists, engineers, technicians and support  staff can design, build and operate the complex “machines” 3. Computers :to collect, store, distribute and analyse the vast amount of data produced by the detectors
View of the ATLAS detector during construction Length  : ~ 46 m Radius  : ~ 12 m Weight : ~ 7000 tons ~108 electronic channels
 A collision at LHC Bunches, each containing 100 billion protons, cross 40 million times a second in the centre of each experiment 1 billion proton-proton interactions per second in ATLAS & CMS ! Large Numbers of collisions per event 	~ 1000 tracks stream into the detector every 25 ns   	a large number of channels (~ 100 M ch)  ~ 1 MB/25ns i.e. 40 TB/s ! 8
The Data Acquisition Cannot possibly extract and record 40 TB/s. Essentially 2 stages of selection  - dedicated custom designed hardware processors  40 MHz  100 kHz - then each ‘event’ sent to a free core in a farm of ~ 30k CPU-cores 100 kHz  few 100 Hz 9
Tier 0 at CERN: Acquisition, First pass processing, Storage & Distribution Ian.Bird@cern.ch 10
First Beam day – 10 Sep. 2008
The LHC Computing Challenge Experiments will produce about 15 Million Gigabytes (15 PB) of data each year (about 20 million CDs!) LHC data analysis requires a computing power equivalent to ~100,000 of today's fastest PC processors (140MSi2K) Analysis carried out at more than 140 computing centres  12 large centres for primary data management: CERN (Tier-0) and eleven Tier-1s 38 federations of smaller Tier-2 centres
Solution: the Grid Use the Grid to unite computing resources of particle physics institutions around the world The World Wide Web provides seamless access to information that is stored in many millions of different geographical locations The Grid is an infrastructure that provides seamless access to computing power and data storage capacity distributed over the globe. It makes multiple computer centres look like a single system to the end-user.
LHC Computing Grid project (WLCG) The grid is complex Highly distributed No central control Lots of software in many languages Grid middleware SLOC – 1.7M Total C++ 850K, C 550K, SH 160K, Java 115K, Python 50K, Perl 35K Experiment code e.g. ATLAS – C++ 7M SLOC Complex services dependencies
My Problem - Monitoring the operational grid infrastructure Tools for Operations and Monitoring Build and run monitoring infrastructure for WLCG Operational tools for management of grid infrastructures Examples: Configuration database Helpdesk/ ticketing Monitoring Availability reporting Early design decision:  Use messaging as an integration framework
Open Source to the core Design and develop services for Open Science based on: Open source software Open protocols Funded by a series of EU Projects EDG, EGEE, EGI.eu, EMI Backed by industry support All our code is open source and freely available Results published in Open Access journals
Use Case– Availability Monitoring and Reporting Monitoring of reliability and availability of European distributed computing infrastructure  Data must be reliable Definitive source of availability and accounting reports Distributed operations model Grid implies ‘cross-administrative domain’ No root login ! Global ticketing Distributed operations dashboards
Solution Distributed monitoring based on Nagios Tied together with ActiveMQ Network of 4 brokers in 3 countries Linked to ticketing and alarm systems Message level signing + encryption for verification of identity Uses STOMP for all communication Code in Perl & python Topics with Virtual Consumers  All persistent messages Topic naming used for filtering and selection
Architecture
Component drilldown
Current Status 16 national level Nagios servers Will grow to ~40 in next 3 months Clients distributed across 40 countries 315 sites 5K services 500,000 test results/day 3 consumers of full data stream to database for analysis and post processing 40 distributed alarm dashboards with filtered feeds
Lessons (1) Just using STOMP is sub-optimal Pros: Very simple Good for lightweight clients in many languages Cons Hard to write reliable long-lived clients No NACK, No heartbeat Ambiguities in the specification Content-length and TextMessage Content-encoding Not really broker independent in practice Interested in contributing to STOMP 1.1/2.0
Lessons (2)	 JMS Durable consumer suck Fragile in Network of Brokers Many problems fixed now by FUSE Virtual Topics solve the problem Pros: Just like a queue Can monitor queue length, purge  Cons Issues with selectors Startup race conditions (solvable via config)
Lessons (3) Network of brokers seem attractive Pros: It’s all a cloud Clients connect anywhere and it “just works” Cons: It’s a very complicated area of code Often you need to “ask the computer” Or a core ActiveMQ developer Trade off between resilience/scaling and complexity
Lessons (4) Know the code Most of it is very simple Even for non-java developers If you keep away from “java-ish” stuff JTA, XA, Spring Plugin architecture is very easy to work with Most things can be implemented by a plugin E.g. Monitoring, logging, restricting features, AuthN/AuthZ Docs currently don’t explain everything Especially the interactions between plugins/features
Lessons (5) Stay in the ballpark If it’s not in tests: Think twice about using the feature in that way… Write a test for it ! Examples SSL and network connectors Network of Brokers with odd topologies STOMP/Openwiredifferences in feature support
Nagios for ActiveMQ We use Nagios to monitor  Brokers Producer/consumers Uses jmx4perl to reduce JVM load on Nagios machine Exposes JMX information as JSON Simple perl interface to write clients Generic nagios checks Looking how to make more available for the community
Broker Monitoring Standard OS information Filesystem full, processes running, socket counts, open file counts JMX for broker statistics Store usage, JVM stats, inactive durable subs, queues with pending messages JMX based scripts to manage brokers Remove unwanted advisories Purge queues with no consumers
Virtual Topic monitoring Full testing of consumers from producers on all brokers in Network of Brokers Consumers instrumented to reply to test messages Addressed to a single client-id on a topic Send message to topic in Reply-To Nagios sends messages to all brokers for a topic Checks they all come back Useful to check that all brokers in network are forwarding correctly
Nagios broker status check
To the future – a generic messaging service Many concurrent applications … … in many languages … … over the WAN … … with little control over the users Not a typical messaging problem ?
Isolate clients from messaging via filesystem Particularly in the WAN Always assume messaging could be uncontactable Keeps “core” broker network small And keeps complexity isolated Allows all clients to use best language/protocol to talk to messaging  Design thoughts – File Based Queue
Design Thoughts– AMQP style remote messaging Queues bound to broker nodes  IP-like routing sends messages to destinations Clients connect to specific instances Better determinacy in network Easier to manage explicit connections between brokers
Summary ActiveMQ is a key technology choice for operating and monitoring the WLCG grid infrastructure It provides a scalable and adaptable platform for building a wide range of messaging based applications FUSE fits our model of open source software with industrial support
Thank you for your attention Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
confluent
 
Securing Kafka At Zendesk (Joy Nag, Zendesk) Kafka Summit 2020
Securing Kafka At Zendesk (Joy Nag, Zendesk) Kafka Summit 2020Securing Kafka At Zendesk (Joy Nag, Zendesk) Kafka Summit 2020
Securing Kafka At Zendesk (Joy Nag, Zendesk) Kafka Summit 2020
confluent
 

Was ist angesagt? (20)

Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedApache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
 
Connecting applicationswitha mq
Connecting applicationswitha mqConnecting applicationswitha mq
Connecting applicationswitha mq
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
 
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache PulsarUnifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
 
Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)
 
Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Field
Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the FieldKafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Field
Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Field
 
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
 
IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)
 
Apache Pulsar First Overview
Apache PulsarFirst OverviewApache PulsarFirst Overview
Apache Pulsar First Overview
 
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&PierreKafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
 
Securing Kafka At Zendesk (Joy Nag, Zendesk) Kafka Summit 2020
Securing Kafka At Zendesk (Joy Nag, Zendesk) Kafka Summit 2020Securing Kafka At Zendesk (Joy Nag, Zendesk) Kafka Summit 2020
Securing Kafka At Zendesk (Joy Nag, Zendesk) Kafka Summit 2020
 
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...
 
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
 
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulBetter Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
 
Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to Kafka
 
Uber: Kafka Consumer Proxy
Uber: Kafka Consumer ProxyUber: Kafka Consumer Proxy
Uber: Kafka Consumer Proxy
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 

Ähnlich wie 1005 cern-active mq-v2

2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it world
Chris Dwan
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networking
OpenSourceIndia
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networking
suniltomar04
 
Desktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omicsDesktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omics
David Wallom
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
Dan Frincu
 
Open solaris customer presentation
Open solaris customer presentationOpen solaris customer presentation
Open solaris customer presentation
xKinAnx
 

Ähnlich wie 1005 cern-active mq-v2 (20)

2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it world
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networking
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networking
 
CERN IT Monitoring
CERN IT Monitoring CERN IT Monitoring
CERN IT Monitoring
 
CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use Case
 
Industrial Pioneers Days - Machine Learning
Industrial Pioneers Days - Machine LearningIndustrial Pioneers Days - Machine Learning
Industrial Pioneers Days - Machine Learning
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task Computing
 
Desktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omicsDesktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omics
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
 
OpenFlow tutorial
OpenFlow tutorialOpenFlow tutorial
OpenFlow tutorial
 
Introduction to socket programming nbv
Introduction to socket programming nbvIntroduction to socket programming nbv
Introduction to socket programming nbv
 
Microservices.pdf
Microservices.pdfMicroservices.pdf
Microservices.pdf
 
Oop2018 tutorial-stal-mo2-io t-arduino-en
Oop2018 tutorial-stal-mo2-io t-arduino-enOop2018 tutorial-stal-mo2-io t-arduino-en
Oop2018 tutorial-stal-mo2-io t-arduino-en
 
Open Science Data Cloud (June 21, 2010)
Open Science Data Cloud (June 21, 2010)Open Science Data Cloud (June 21, 2010)
Open Science Data Cloud (June 21, 2010)
 
In search of the perfect IoT Stack - Scalable IoT Architectures with MQTT
In search of the perfect IoT Stack - Scalable IoT Architectures with MQTTIn search of the perfect IoT Stack - Scalable IoT Architectures with MQTT
In search of the perfect IoT Stack - Scalable IoT Architectures with MQTT
 
Open source building blocks for the Internet of Things - Jfokus 2013
Open source building blocks for the Internet of Things - Jfokus 2013Open source building blocks for the Internet of Things - Jfokus 2013
Open source building blocks for the Internet of Things - Jfokus 2013
 
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @Java
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
Networking Basics
Networking BasicsNetworking Basics
Networking Basics
 
Open solaris customer presentation
Open solaris customer presentationOpen solaris customer presentation
Open solaris customer presentation
 

Mehr von James Casey

Chef Analytics (Chef NYC Meeting - July 2014)
Chef Analytics (Chef NYC Meeting - July 2014)Chef Analytics (Chef NYC Meeting - July 2014)
Chef Analytics (Chef NYC Meeting - July 2014)
James Casey
 

Mehr von James Casey (9)

Habitat on AKS - Demo
Habitat on AKS - DemoHabitat on AKS - Demo
Habitat on AKS - Demo
 
Compliance at Velocity with Chef
Compliance at Velocity with ChefCompliance at Velocity with Chef
Compliance at Velocity with Chef
 
Chef Analytics Webinar
Chef Analytics WebinarChef Analytics Webinar
Chef Analytics Webinar
 
Chef Analytics (Chef NYC Meeting - July 2014)
Chef Analytics (Chef NYC Meeting - July 2014)Chef Analytics (Chef NYC Meeting - July 2014)
Chef Analytics (Chef NYC Meeting - July 2014)
 
Chef Actions: Delightful near real-time activity tracking!
Chef Actions: Delightful near real-time activity tracking!Chef Actions: Delightful near real-time activity tracking!
Chef Actions: Delightful near real-time activity tracking!
 
Chef - Configuration Management for the Cloud
Chef - Configuration Management for the CloudChef - Configuration Management for the Cloud
Chef - Configuration Management for the Cloud
 
WLCG Grid Infrastructure Monitoring
WLCG Grid Infrastructure MonitoringWLCG Grid Infrastructure Monitoring
WLCG Grid Infrastructure Monitoring
 
Grid Information systems from an Operations Perspective
Grid Information systems from an Operations PerspectiveGrid Information systems from an Operations Perspective
Grid Information systems from an Operations Perspective
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to Django
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

1005 cern-active mq-v2

  • 1. James Casey CERN IT Department Grid Technologies Group FUSE Community Day, London, 2010 Using ActiveMQ at CERNfor the Large Hadron Collider
  • 2. Overview What we do at CERN Current ActiveMQ Usage Monitoring a distributed infrastructure Lessons Learned Future ActiveMQ Usage Building a generic messaging service
  • 3. LHC is a very large scientific instrument… CMS LHCb ALICE ATLAS Large Hadron Collider 27 km circumference Lake Geneva
  • 4. … based on advanced technology 27 km of superconducting magnetscooled in superfluid helium at 1.9 K
  • 5. What are we looking for? To answer fundamental questions about the construction of the universe Why have we got mass ? (Higgs Boson) Search for a Grand Unified Theory Supersymmetry Dark Matter, Dark Energy Antimatter/matter asymmetry
  • 6. This Requires……. 1. Accelerators : powerful machines that accelerate particles to extremely high energies and then bring them into collision with other particles 2. Detectors : gigantic instruments that record the resulting particles as they “stream” out from the point of collision. 4. People : Only a worldwide collaboration of thousands of scientists, engineers, technicians and support staff can design, build and operate the complex “machines” 3. Computers :to collect, store, distribute and analyse the vast amount of data produced by the detectors
  • 7. View of the ATLAS detector during construction Length : ~ 46 m Radius : ~ 12 m Weight : ~ 7000 tons ~108 electronic channels
  • 8. A collision at LHC Bunches, each containing 100 billion protons, cross 40 million times a second in the centre of each experiment 1 billion proton-proton interactions per second in ATLAS & CMS ! Large Numbers of collisions per event ~ 1000 tracks stream into the detector every 25 ns a large number of channels (~ 100 M ch)  ~ 1 MB/25ns i.e. 40 TB/s ! 8
  • 9. The Data Acquisition Cannot possibly extract and record 40 TB/s. Essentially 2 stages of selection - dedicated custom designed hardware processors  40 MHz  100 kHz - then each ‘event’ sent to a free core in a farm of ~ 30k CPU-cores 100 kHz  few 100 Hz 9
  • 10. Tier 0 at CERN: Acquisition, First pass processing, Storage & Distribution Ian.Bird@cern.ch 10
  • 11. First Beam day – 10 Sep. 2008
  • 12.
  • 13. The LHC Computing Challenge Experiments will produce about 15 Million Gigabytes (15 PB) of data each year (about 20 million CDs!) LHC data analysis requires a computing power equivalent to ~100,000 of today's fastest PC processors (140MSi2K) Analysis carried out at more than 140 computing centres 12 large centres for primary data management: CERN (Tier-0) and eleven Tier-1s 38 federations of smaller Tier-2 centres
  • 14. Solution: the Grid Use the Grid to unite computing resources of particle physics institutions around the world The World Wide Web provides seamless access to information that is stored in many millions of different geographical locations The Grid is an infrastructure that provides seamless access to computing power and data storage capacity distributed over the globe. It makes multiple computer centres look like a single system to the end-user.
  • 15. LHC Computing Grid project (WLCG) The grid is complex Highly distributed No central control Lots of software in many languages Grid middleware SLOC – 1.7M Total C++ 850K, C 550K, SH 160K, Java 115K, Python 50K, Perl 35K Experiment code e.g. ATLAS – C++ 7M SLOC Complex services dependencies
  • 16. My Problem - Monitoring the operational grid infrastructure Tools for Operations and Monitoring Build and run monitoring infrastructure for WLCG Operational tools for management of grid infrastructures Examples: Configuration database Helpdesk/ ticketing Monitoring Availability reporting Early design decision: Use messaging as an integration framework
  • 17. Open Source to the core Design and develop services for Open Science based on: Open source software Open protocols Funded by a series of EU Projects EDG, EGEE, EGI.eu, EMI Backed by industry support All our code is open source and freely available Results published in Open Access journals
  • 18. Use Case– Availability Monitoring and Reporting Monitoring of reliability and availability of European distributed computing infrastructure Data must be reliable Definitive source of availability and accounting reports Distributed operations model Grid implies ‘cross-administrative domain’ No root login ! Global ticketing Distributed operations dashboards
  • 19. Solution Distributed monitoring based on Nagios Tied together with ActiveMQ Network of 4 brokers in 3 countries Linked to ticketing and alarm systems Message level signing + encryption for verification of identity Uses STOMP for all communication Code in Perl & python Topics with Virtual Consumers All persistent messages Topic naming used for filtering and selection
  • 22. Current Status 16 national level Nagios servers Will grow to ~40 in next 3 months Clients distributed across 40 countries 315 sites 5K services 500,000 test results/day 3 consumers of full data stream to database for analysis and post processing 40 distributed alarm dashboards with filtered feeds
  • 23. Lessons (1) Just using STOMP is sub-optimal Pros: Very simple Good for lightweight clients in many languages Cons Hard to write reliable long-lived clients No NACK, No heartbeat Ambiguities in the specification Content-length and TextMessage Content-encoding Not really broker independent in practice Interested in contributing to STOMP 1.1/2.0
  • 24. Lessons (2) JMS Durable consumer suck Fragile in Network of Brokers Many problems fixed now by FUSE Virtual Topics solve the problem Pros: Just like a queue Can monitor queue length, purge Cons Issues with selectors Startup race conditions (solvable via config)
  • 25. Lessons (3) Network of brokers seem attractive Pros: It’s all a cloud Clients connect anywhere and it “just works” Cons: It’s a very complicated area of code Often you need to “ask the computer” Or a core ActiveMQ developer Trade off between resilience/scaling and complexity
  • 26. Lessons (4) Know the code Most of it is very simple Even for non-java developers If you keep away from “java-ish” stuff JTA, XA, Spring Plugin architecture is very easy to work with Most things can be implemented by a plugin E.g. Monitoring, logging, restricting features, AuthN/AuthZ Docs currently don’t explain everything Especially the interactions between plugins/features
  • 27. Lessons (5) Stay in the ballpark If it’s not in tests: Think twice about using the feature in that way… Write a test for it ! Examples SSL and network connectors Network of Brokers with odd topologies STOMP/Openwiredifferences in feature support
  • 28. Nagios for ActiveMQ We use Nagios to monitor Brokers Producer/consumers Uses jmx4perl to reduce JVM load on Nagios machine Exposes JMX information as JSON Simple perl interface to write clients Generic nagios checks Looking how to make more available for the community
  • 29. Broker Monitoring Standard OS information Filesystem full, processes running, socket counts, open file counts JMX for broker statistics Store usage, JVM stats, inactive durable subs, queues with pending messages JMX based scripts to manage brokers Remove unwanted advisories Purge queues with no consumers
  • 30. Virtual Topic monitoring Full testing of consumers from producers on all brokers in Network of Brokers Consumers instrumented to reply to test messages Addressed to a single client-id on a topic Send message to topic in Reply-To Nagios sends messages to all brokers for a topic Checks they all come back Useful to check that all brokers in network are forwarding correctly
  • 32. To the future – a generic messaging service Many concurrent applications … … in many languages … … over the WAN … … with little control over the users Not a typical messaging problem ?
  • 33. Isolate clients from messaging via filesystem Particularly in the WAN Always assume messaging could be uncontactable Keeps “core” broker network small And keeps complexity isolated Allows all clients to use best language/protocol to talk to messaging Design thoughts – File Based Queue
  • 34. Design Thoughts– AMQP style remote messaging Queues bound to broker nodes IP-like routing sends messages to destinations Clients connect to specific instances Better determinacy in network Easier to manage explicit connections between brokers
  • 35. Summary ActiveMQ is a key technology choice for operating and monitoring the WLCG grid infrastructure It provides a scalable and adaptable platform for building a wide range of messaging based applications FUSE fits our model of open source software with industrial support
  • 36. Thank you for your attention Questions?