Weitere ähnliche Inhalte
Kürzlich hochgeladen (20)
Tactics
- 1. Using Models to Improve the Availability of Automotive Software
Architectures
Charles Shelton, Christopher Martin
Research and Technology Center, Robert Bosch LLC
[Charles.Shelton, Christopher.Martin]@us.bosch.com
Abstract the main CPU on the ECU. The application software
running on the CPU periodically sends a signal to the
This paper presents an initial model for evaluating watchdog indicating that it is still functioning. If a
and improving the availability of a software failure occurs in the ECU software, it would not be
architecture design. The model is implemented as a able to send this signal, and thus the watchdog would
reasoning framework in the ArchE architecture expert determine that a system failure has occurred. Once the
system developed jointly with the Software Engineering watchdog detects a failure, it triggers a system reset to
Institute. To ensure continuous availability many recover the system and resume normal operation.
automotive electronic control units (ECUs) employ an The watchdog concept has been widely used in the
external watchdog running on a separate CPU to industry for at least 20 years [5]. However, there has
monitor the software running on the ECU. If the ECU been little effort to evaluate the effectiveness of the
has a failure that causes interruption of its watchdog, and whether it provides the improved
functionality, the watchdog can detect this and reset availability it promises. Many automotive software
the ECU to restore correct operation. The availability developers and architects regard the watchdog as an
model can automatically evaluate the effectiveness of a “insurance policy” to protect against any unforeseen
watchdog design in the software architecture and can hardware and software faults in the ECU. Therefore, it
propose improvements to achieve better availability is important to understand how effective a watchdog is
before implementation decisions are made. The model in improving the availability of the ECU software.
enables a quantitative analysis of system availability We developed the availability model as a reasoning
that can better guide software architecture and framework (RF) for the ArchE architecture expert
dependability design decisions and potentially reduce system that has been developed with Bosch support at
implementation and testing effort. the Software Engineering Institute (SEI) at Carnegie
Mellon University (CMU) (a full description of the
1. Introduction ArchE methodology is available in [1]). ArchE
(Architecture Expert) is a rule-based expert system that
Many automotive electronic control units (ECUs) uses models to evaluate a software architecture design
implement most of their functionality in real-time and how well it satisfies its non-functional quality
software systems. Thus, ensuring the availability of the requirements (e.g. real-time performance and
software system is essential to guaranteeing the modifiability), and automatically proposes suggestions
dependable operation of the ECU. This paper presents to improve the architecture design when it does not
a model for evaluating the availability of a software satisfy those requirements. The Bosch Rapid
architecture. The results of this model can then be used Architecture Prototyping tool (RAPT) focuses on
to judge the comparative effectiveness of different incorporating the ArchE expert system into a design
design mechanisms in the software architecture for tool for software architects that enables model-based
improving system availability. software architecture design.
For our initial study, we focus on the watchdog The remainder of this paper is organized as follows.
concept. The watchdog is a major design mechanism Section 2 gives an overview of ArchE with important
being used in ECUs to compensate for transient system definitions. Section 3 describes the availability
failures and maintain availability. A watchdog is reasoning framework we developed and how we model
usually an external circuit or processor that monitors and calculate availability for a software architecture
Fourth International Workshop on Software Engineering for Automotive Systems (SEAS'07)
0-7695-2968-2/07 $20.00 © 2007
- 2. User Interface Interaction
Architect
ArchE User •Results of RF Evaluations
•Selected Design Tactics Proposed
Architecture Input
to the Architect
Expert System
Seeker
•Requirements Input from User • Model Evaluation Results from RFs
•Commands to Apply Tactics for an RF • Design Tactic Suggestions from RFs
Performance Availability Modifiability
RF RF RF
Figure 1. High-Level Modules in ArchE
design. Section 4 details some open issues with the software requirements that must be fulfilled. Both
accuracy of the model and our plan for validating the functional and non-functional requirements must be
model. Section 5 concludes the paper. specified.
Functional requirements are specified as
2. Overview of ArchE responsibilities, and non-functional quality
requirements are specified as quality attribute
ArchE is composed of multiple reasoning scenarios. Responsibilities represent the units of
frameworks (RFs) that evaluate the architecture with functionality that the software must provide. They are
respect to a particular quality. Each RF can process the “atomic” units that are assigned to architecture
requirements for its quality, generate an initial design elements (modules and tasks) in the architecture
architecture design based on the requirements, evaluate to be implemented. Each responsibility will have a set
how well the architecture satisfies the quality of parameters (e.g. execution time in milliseconds, cost
requirements using a model derived from the of change in person-days) for which the architect must
architecture design, and propose design suggestions provide some initial estimates. These are the input
(tactics) to improve the architecture and bring it closer parameters that will be used for executing the RF
to satisfying its quality requirements. models and evaluating the architecture design for each
Since each RF only evaluates a single quality, they quality.
may propose tactics that conflict with each other to Quality attribute scenarios (described in detail in
satisfy their individual quality requirements. For [2]) define the software quality requirements in a
example, a modifiability tactic that decomposes and concise format. Each scenario includes a response
encapsulates software modules behind interfaces may measure that specifies a quantitative constraint that can
introduce additional runtime execution penalties that be evaluated based on the results of an RF model. A
adversely affect real-time performance. Therefore, scenario is defined by six parts: stimulus, source of
ArchE has an arbitration module, the Seeker, that stimulus, environment, artifact, response, and response
collects the results and suggestions from each RF, measure. The response measure is the critical part of
determines what potential side effects there are for the scenario because it provides a quantitative
applying each tactic, and evaluates the tactics to decide constraint with which to evaluate whether the scenario
which tactics promise the most net improvement of the is satisfied. Table 1 illustrates an example of a real-
architecture design. Figure 1 shows the decomposition time performance quality attribute scenario. Each RF
of the ArchE system and its major components. takes the set of scenarios for its quality and uses their
response measures as constraints to evaluate whether
2.1 Basic ArchE Concepts the model derived from the architecture satisfies its
requirements.
The first step in providing information to ArchE to
evaluate an architecture design is to provide the
Fourth International Workshop on Software Engineering for Automotive Systems (SEAS'07)
0-7695-2968-2/07 $20.00 © 2007
- 3. Table 1. Example Quality Attribute Scenario existing architecture design to evaluate. The RF will
Performance Scenario: A controller receives periodic use some general heuristics to develop an initial
input from a sensor every 10 ms. The controller must run architecture. For the performance RF, the rules assume
its control algorithm and send an output to a system that each scenario and its response measure define a
actuator within 10 ms after receiving the sensor input.
separate task in the architecture, and creates the set of
Input data received from
Source of Stimulus
sensor
tasks. It assigns responsibilities to each task according
Periodic activation every 10 to which responsibilities are linked to each scenario.
Stimulus
ms Each task’s period (or minimum arrival time for a
Environment Normal run-time operation sporadic task) is derived from its scenario’s stimulus,
Control algorithm, sensor, and its deadline is derived from its scenario’s response
Artifact
actuator measure. Once these tasks are generated the
Compute and update performance RF can evaluate the architecture design
Response controller output value to
actuator
using a real-time performance analysis.
Complete controller Alternatively, if the architect has a specific design
Response Measure
operation within 10 ms he or she wants to evaluate, this design can be used
instead of the output of the Initial Design Creation
2.2 Elements of a Reasoning Framework module. Once an architecture design is present, the RF
can interpret and evaluate its model to determine
Each RF consists of several modules, each whether the architecture satisfies its scenarios for that
containing specific sets of rules in the ArchE system. quality.
These modules provide the following functions: The Model Interpretation and Evaluation module
• Scenario and Responsibility Parameter Definition contains the rules to interpret a model internal to the
• Initial Design Creation RF from the architecture design. This model is then
• Model Interpretation and Evaluation evaluated and the results are used to judge whether the
• Suggest Design Tactics architecture satisfies its scenarios. In the performance
• Apply Design Tactics RF, rate-monotonic analysis (RMA) [4] is applied to
We describe these modules using the real-time the set of tasks in the architecture to calculate the
performance RF as an example. latency for each task. The latency for each task is
In the Scenario and Responsibility Parameter compared to its deadline to determine if the
Definition module, the scenario type for the RF’s architecture satisfies the response measure of the
quality is defined. Each scenario specified for this scenario for that task.
quality must conform to the format as defined by the If the architecture does not satisfy some of the
RF. This enables the rules in ArchE to automatically scenarios for a given quality, the RF will execute the
process the scenarios in each RF. When considering Suggest Design Tactics module. In this module there
real-time performance, each scenario’s response are rules to select possible design changes to the
measure determines a real-time deadline for a architecture and evaluate whether they improve the
particular system function, and the scenario’s stimulus model in the RF. The RF will select the most
indicates whether the function is periodic or sporadic. promising tactics that show the greatest improvement in
In addition to the scenario type definition, each RF terms of satisfying the scenarios for that quality, and
defines a set of parameters that must be provided as send them to the Seeker for arbitration with possible
input for each functional responsibility. These tactics from other RFs. The Seeker module is
parameters are required as inputs for the execution of independent from all of the ArchE RFs and will decide
the analysis model. The performance RF requires that which tactics taken from all of the RFs to present to the
each responsibility specify an execution time. With user. The Seeker makes this decision by prioritizing
these execution times, the performance RF can assign the tactic suggestions received from the RFs according
execution times to each task in the architecture based to their net improvement of the architecture design
on which functional responsibilities are assigned to towards satisfying all requirements scenarios. The user
which tasks. will then select a tactic to apply to change the
The Initial Design Creation module contains rules architecture design.
for generating an initial architecture design only from In the performance RF, design tactic suggestions
the requirements (responsibilities and scenarios) include reducing the execution time for a
provided. This is required when a project starts only responsibility, increasing the period for task, and
with a requirements specification and there is no lengthening the deadline for a task. The performance
Fourth International Workshop on Software Engineering for Automotive Systems (SEAS'07)
0-7695-2968-2/07 $20.00 © 2007
- 4. RF will try these tactics out for each task that does not depends on having an existing performance
satisfy its scenario, rerun the RMA model, and select architecture analyzed by the performance RF.
the tactics that product the greatest overall latency We developed the availability RF using the standard
improvement for the architecture tasks. ArchE RF structure. The next few sections describe
Finally, the Apply Design Tactics module contains the components of the availability RF.
the rules that receive the user’s input for selecting a
tactic, and will alter the architecture design according 3.1 Availability Scenarios and Parameters
to the user’s response. For example, if the user selects
a reduce execution time tactic for a responsibility, the We identified four types of availability scenarios to
performance RF has the rules that make ArchE update be evaluated by the RF:
the responsibility parameter value, and update the task • General System Availability – This scenario
execution time for the task containing that type describes the overall availability target for
responsibility in the architecture design. the system, such as “five nines” or 0.99999
availability. There should only be one of these
2.3 The RAPT Tool scenarios for a single architecture design.
• Maximum Recovery Time – This scenario type
At Bosch, we have incorporated the ArchE expert puts a constraint on the maximum time it takes
system into our RAPT architecture design tool. RAPT the system to recover to a known good state once
is a tool implemented in the Java-based Eclipse a failure is detected.
framework [3]. RAPT is intended to provide a more • Minimum Time Before Recovery – This
streamlined user interface for automotive software scenario type puts a constraint on the minimum
architects that can encapsulate some of the more formal time the system should wait before initiating a
details of the ArchE expert system and its quality recovery action after a failure is detected. This
attribute RFs. Users can input their requirements and scenario helps specify a “grace period” after a
architecture designs into RAPT, and then click a button failure is detected, in case the detection is a false
to have their architecture evaluated by ArchE. positive and does not require the watchdog to
In addition to providing a user interface for ArchE, perform a system reset. This time allows the
the RAPT tool can store and retrieve requirements and system a “second chance” to send its service
architecture designs as models in an XML format, signal to the watchdog.
stored in an online database. The RAPT tool can • Maximum Time for Failure Detection – This
search the database for requirements or design scenario type puts a constraint on the maximum
elements used in previous projects. These items can time it should take for a failure to be detected by
then be immediately incorporated into new architecture the system. In terms of the watchdog, this puts
designs and requirements specifications, encouraging an upper bound on the timeout period for the
reuse of software architecture assets. watchdog.
The parameters required for the functional
3. Availability Reasoning Framework responsibilities in the availability RF include:
• Failure Rate – The architect must give a rough
We developed the availability RF specifically to estimate of the failure rate expressed as number
evaluate the effectiveness of watchdog configurations of failures per hour for the implementation of
in automotive applications. Therefore, we do not each functional responsibility in the system.
currently address other possible dependability This can be based on data from previous
mechanisms that could contribute to system software systems, failure rates of underlying
availability. The goal of this RF is to provide a hardware resources being used by each particular
quantitative evaluation framework for determining the function, or developer experience.
required design parameters for a watchdog • Jitter – The architect must specify the amount of
configuration in order to satisfy availability constraints jitter in the execution time of the responsibility,
and requirements. expressed as a percentage of the responsibility’s
The model for evaluating availability draws directly total execution time.
from the real-time performance task architecture. The • Jitter Rate – The architect must specify the rate
tasks defined in the architecture specify run-time at which each responsibility experiences jitter in
characteristics that can be used to evaluate availability its execution time.
on a task-by-task basis. Therefore, the availability RF
Fourth International Workshop on Software Engineering for Automotive Systems (SEAS'07)
0-7695-2968-2/07 $20.00 © 2007
- 5. Worst-case downtime per failure =
System
t_wd + t_grace + t_recovery System Restarts
Failure; System
System System System In Working
WD NOT Services
Services Services Services State
Serviced WD
WD WD WD
time
T0 T1 T2 T3 T4 T’0 T’1
WD “Grace System Recovery
WD Timeout
Period” t_grace Time t_recovery
Period t_wd
WD Triggers
System Reset
Figure 2. Timeline Showing System and Watchdog (WD) Interaction When a Failure Occurs
The availability RF uses the failure rate parameters unavailability due to the individual failures of each
for each responsibility as the initial input to the task, and compute system availability as 1 minus the
availability model to determine how often the system sum of all task unavailability values. For the purposes
might experience a failure. The jitter and jitter rate of the model, we assume that task failures are
parameters are used to model how often jitter may independent and will each cause a system failure,
cause the system to miss a watchdog service deadline, which means the system’s watchdog servicing task does
but not cause a system failure. This case would not run, so that the watchdog detects a failure and
represent a false positive error detection. restarts the system. If the input parameters are
accurate, these assumptions will provide a conservative
3.2 Initial Availability Architecture Design upper bound on the unavailability of each task, and a
lower bound on the system availability.
Since the availability RF uses the run-time To calculate the system unavailability due to each
performance architecture as a basis, the rules for an task, we must evaluate two cases; genuine task failures,
initial availability design are straightforward. The RF and task overruns due to jitter, which cause false
generates an initial configuration for a watchdog to positive failures. For the case of real task failures, we
monitor the software system, and creates a low priority must first calculate the task failure rate based on the
watchdog servicing task to be added to the system task failure rates of its responsibilities, and then calculate
architecture. The servicing task has a period that the unavailability penalty due to this task failure rate.
equals the watchdog configuration’s timeout value. For Assuming that the failure rates of individual
an initial configuration, this timeout value is set to 20% responsibilities in a task are independent, the overall
longer than the task with the longest period in the task failure rate FRT is given by:
software architecture. FRT = 1 – Product(all 1 – FRRi)
The RF will also ask the user to provide initial Where FRRi is the failure rate for each of the
values for the grace period and recovery time responsibilities in allocated to the task. Since the task
parameters of the watchdog. can only fail when it is executing, the task failure rate
must be multiplied by the task utilization UT, which is
3.3 Model Interpretation and Evaluation given by the ratio of the task’s execution time to its
period. To calculate the unavailability due to this task,
Availability is defined as the ratio of expected we multiply that result by the time penalty due to the
uptime (when the system can provide service) to the detection DWD, grace period GWD, and recovery times
total operating time of the system. This can also be RTWD of the watchdog configuration:
expressed as 1 minus the unavailability, which is the Task Unavailability due to failures TUF =
ratio of downtime, when the system is expected to FRT * UT * (DWD + GWD + RTWD)
provide service but is not, to total operating time. See Figure 2 for a timeline illustrating what happens
Using the input parameters provided by the architect when a task failure occurs.
and the performance task architecture, we calculate the To calculate the unavailability due to task jitter, a
similar method can be applied. The worst case task
Fourth International Workshop on Software Engineering for Automotive Systems
29th International Conference on Software Engineering Workshops(ICSEW'07) (SEAS'07)
0-7695-2968-2/07
0-7695-2830-9/07 $20.00 © 2007
- 6. jitter rate JRT can be calculated from the individual timeout period might be more sensitive to false
responsibility jitter rates in a manner similar to how we positives from task jitter.
calculate the task failure rate from individual The watchdog grace period has a similar tension
responsibility failure rates. We use a conservative between false positives and real failures. A long grace
worst case jitter execution time penalty for the task as period for the watchdog will eliminate more false
the sum of all the jitter values for the individual positives, but cause a greater unavailability penalty for
responsibilities in that task. Then the unavailability each real task failure. The availability RF must
due to task jitter is simply: compare the measures of unavailability due to task
Task Unavailability due to jitter TUJ = failures and task jitter to decide whether to propose an
JRT * U’T * (DWD + GWD + RTWD) increase or decrease to the grace period value.
However, we cannot automatically assume that task The system recovery time after a watchdog reset is
jitter will cause the system to miss its watchdog the majority of the time spent recovering the system
service. In order to determine if the task jitter causes after a failure. The availability RF would always like
the watchdog servicing task to miss its deadline, we to propose reducing this parameter in the watchdog
must reanalyze the performance task architecture using configuration to improve availability. However, this
the worst case jitter value added to the task. When the recovery time is largely dependent on the ECU
task latencies are recalculated, if the watchdog service hardware configuration, and may not be tunable by the
task has overrun its deadline by more than the software architects or developers. Ultimately the
watchdog configuration’s grace period, then we can software architect must decide if applying this tactic is
reason that the task jitter will cause a false positive possible and appropriate for their system.
failure detection and watchdog reset. Otherwise, the The availability RF can test these design tactics by
task jitter will be tolerated by the watchdog and no rerunning the availability model to compare the new
unavailability penalty will be assessed. model’s availability metric with the original model.
Finally the system availability can be calculated The RF will then propose the tactics most likely to
from the results of all the task unavailability values: provide the biggest gain in availability for the software
Overall Availability = 1 – (sum of all TUFi + TUJi) architecture. The architect can then decide whether to
For all i tasks in the architecture design. select one of the tactics proposed by ArchE. The
architecture design will be updated, and the process
3.4 Availability Design Tactics will iterate until all scenarios are satisfied.
Another possible availability tactic might include
If the calculated system availability does not meet increasing the period or reducing the execution time of
the availability scenarios’ response measures, the a task that has a relatively high failure rate. This would
availability RF will suggest design tactics to improve effectively reduce the portion of time the task runs in
the architecture and satisfy its requirements. With the the system, reducing the chance that it can cause a
results of the availability model, the availability RF has system failure. However, this tactic could conflict with
heuristic rules that decide which tactics to propose. the real-time performance RF. A change in the task
A watchdog configuration has three major architecture may cause some performance scenarios to
parameters associated with it that can be manipulated become unfulfilled. This is one example of a possible
in the design to affect system availability: tradeoff decision, and in this situation the architect
• Watchdog timeout value must decide which scenarios (performance or
• Watchdog grace period value availability) are more important.
• System recovery time after a watchdog reset
The watchdog timeout value is an upper bound on 4. Open Issues
the time it will take for the watchdog to detect a system
failure. Shortening the timeout period for the watchdog One of the major drawbacks to this model-based
could increase availability because failures would be approach for availability evaluation is that the architect
detected more quickly, and thus recovered more must provide some nominal failure rate values for the
quickly. However, in order to catch all possible task functions in the system. Determining failure rates for
failures, the watchdog service task should be the lowest software functions and modules is a challenging
priority task, meaning the deadline for the service task problem that remains unsolved. The issue is even more
in the software architecture and the watchdog timeout problematic considering that this architecture
period should be longer than the lowest priority availability analysis may be done before the
software application task. Also, a shorter watchdog implementation is built. Our current approach is to
Fourth International Workshop on Software Engineering for Automotive Systems
29th International Conference on Software Engineering Workshops(ICSEW'07) (SEAS'07)
0-7695-2968-2/07
0-7695-2830-9/07 $20.00 © 2007
- 7. base failure rate estimates on underlying hardware of rules for suggesting design tactics to improve the
resources accessed by the software functions. Also, it architecture to fulfill specific requirements.
may not be as important to have precise failure Additionally, the RAPT tool provides a more intuitive
numbers to generate a completely accurate availability user interface and also provides support for reuse of
estimate, as long as we can evaluate a range of requirements and design elements via an online
configurations and their response to varying failure repository.
rates. The responsibility parameters in ArchE can be Within the ArchE expert system, multiple non-
quickly modified to observe the response of the functional quality attributes, such as performance,
availability model. This should be useful for finding modifiability and availability, can be evaluated to
“good enough” watchdog configurations to satisfy assess the ability of a software architecture to satisfy its
availability quality requirements and optimize the requirements. Each RF will provide design suggestions
architecture to satisfy all of its quality requirements. to improve the architecture, and ArchE can evaluate the
side effects from each quality attribute RF’s tactics on
5. Conclusions and Future Work the other quality attributes of the architecture design.
Thus, tradeoff decisions can be explicitly evaluated.
In this paper we presented a framework for As shown earlier, the performance and availability RFs
evaluating and improving availability in a software can have conflicts since their models are based on the
architecture design. This framework uses a model to same runtime task architecture.
determine the availability, and heuristics to propose Future work for the availability RF will include
improvements to the architecture if it does not satisfy expanding its scope to propose design tactics for other
its availability requirements. Although this approach dependability mechanisms besides the watchdog. This
depends on some initial estimates for software failure will allow the availability model to provide more
rates, the model can be used to identify a range of general architecture design assistance.
acceptable solutions for a given set of requirements and
a range of possible failure rates. 6. Acknowledgements
Our availability RF enables an architect to explore
design configurations for a watchdog configuration and This work builds on the ArchE concepts developed
how the watchdog can improve system availability. in conjunction with the Software Engineering Institute
This is important for automotive ECU software at Carnegie Mellon University.
designers because the watchdog concept is used as an
important dependability mechanism to tolerate 7. References
unforeseen transient software and hardware failures.
Despite the importance of the watchdog and its near- [1] F. Bachmann, L. Bass, M. Klein, and C. Shelton,
ubiquitous use in ECUs, there has been little work done “Designing Software Architectures to Achieve Quality
to model and evaluate how the watchdog contributes to Attribute Requirements”, IEE Proceedings on Software, pp.
system availability and dependability. This work 153-165, Vol. 152, No. 5, pp. 153-165, August 2005.
addresses that concern.
[2] L. Bass, P. Clements, and R. Kazman, Software
The current RF does not address other dependability
Architecture in Practice, Second Edition, Addison-Wesley,
concerns such as reliability, safety, and security. Since Boston, MA, USA, 2003.
the focus of this work was to develop a specific
quantitative model that could be used to help make [3] The Eclipse Foundation, “Eclipse – An Open
sound architecture design decisions regarding the Development Platform”, <http://www.eclipse.org>, January
watchdog, we decided to narrow the scope to the 2007.
dependability quality most directly affected by the
watchdog mechanism; availability. Other [4] M. Klein, T. Ralya, B. Pollak, et al., A Practitioner’s
dependability concerns could be addressed with Handbook for Real-Time Analysis : Guide to Rate
Monotonic Analysis for Real-Time Systems, Kluwer
additional RFs with separate analysis models.
Academic Publishers, Boston, MA, USA, 1993.
ArchE enables software architects to analyze and
improve their software architecture designs using RFs [5] A. Mahmood, E. J. McCluskey, “Concurrent Error
to evaluate the architecture against its quality Detection Using Watchdog Processors – A Survey,” IEEE
requirements using quantitative models. Each RF Transactions on Computers, Vol. 37, No. 2, pp. 160-174,
incorporates well known design strategies in the form February 1988.
Fourth International Workshop on Software Engineering for Automotive Systems
29th International Conference on Software Engineering Workshops(ICSEW'07) (SEAS'07)
0-7695-2968-2/07
0-7695-2830-9/07 $20.00 © 2007