Models

Published on September 2016 | Categories: Documents | Downloads: 70 | Comments: 0 | Views: 411
of 81
Download PDF   Embed   Report

Comments

Content

On Stochastic Dynamic Programming and its Application to Maintenance

FRANCOIS BESNARD ¸

Master’s Degree Project Stockholm, Sweden 2007

On Stochastic Dynamic Programming and its Application to Maintenance

MASTER THESIS BY FRANÇOIS BESNARD

Master Thesis written at the Royal Institute of Technology KTH School of Electrical Engineering, June 2007 Supervisor: Assistant Professor Lina Bertling (KTH), Professor Michael Patriksson (Chalmers Applied Mathematics), Dr. Erik Dotzauer (Fortum) Examiner: Assistant Professor Lina Bertling

XR-EE-ETK 2007:008

Abstract
The market and competition laws are introduced among power system companies due to the restructuration and deregulation of the power system. The generating companies, as well as transmission and distribution system operators aim to minimize their costs. Maintenance can be a significant part of the total costs. The pressure to reduce the maintenance budget leads to a need for efficient maintenance. This work focus on an optimization methodology that could be useful for optimizing maintenance. The method, stochastic dynamic programming, is interesting because it can integrate explicitely the stochastic behavior of functional failures. Different models based on stochastic dynamic programming are reviewed with the possible optimization methods to solve them. The interests of the models in the context of maintenance optimization are discussed. An example on a multi-component replacement application is proposed to illustrate the theory. Keywords: Maintenance Optimization, Dynamic Programming, Markov Decision Process, Power Production

III

Acknowledgements
First of all, I would like to thank my supervisors who each in their way supported me in this work. Ass. Prof. Lina Bertling for her encouragements, constructive remarks and for giving me the opportunity of working on this project, Dr. Erik Dotzauer for many valuable inputs, discussions and comments and Prof. Michael Patriksson for his help on mathematical writing. Special greetings to all my friends and companions of study all over the world. Finally, my heart turns to my parents and my love for their endless encouragements and support in my studies and life. Stockholm June 2007

V

Abreviations
ADP: Approximate Dynamic Programming CBM: Condition Based Maintenance CM: Corrective Maintenance DP: Dynamic Programming IHSDP: Infinite Horizon Stochastic Dynamic Programming LP: Linear Programming MDP: Markov Decision Process PI: Policy Iteration PM: Preventive Maintenance RCAM: Reliability Centered Asset Maintenance RCM: Reliability Centered Maintenance SDP: Stochastic Dynamic Programming SMDP: Semi-Markov Decision Process TBM: Time Based Maintenance VI: Value Iteration

VII

Notations
Numbers M Number of iteration for the evaluation step of modified policy iteration N Number of stages Constant α Discount factor ll Variables i State at the current stage j State at the next stage k Stage m Number of iteration left for the evaluation step of modified policy iteration q Iteration number for the policy iteration algorithm u Decision variable State and Control Space µk Function mapping the states with a decision ∗ (i) µk Optimal decision at state k for state i µ Decision policy for stationary systems ∗ µ Optimal decision policy for stationary systems π Policy π∗ Optimal policy Uk Decision action at stage k ∗ Uk (i) Optimal decision action at stage k for state i Xk State at stage k Dynamic and Cost functions Ck (i, u) Cost function Ck (i, u, j) Cost function Cij (u) = C(i, u, j) Cost function if the system is stationary CN (i) Terminal cost for state i fk (i, u) Dynamic function fk (i, u, ω) Stochastic dynamic function ∗ Jk (i) Optimal cost-to-go from stage k to N starting from state i ωk (i, u) Probabilistic function of a disturbance s Pk (j, u, i) Transition probability function P (j, u, i) Transition probability function for stationary systems V (Xk ) Cost-to-go resulting of a trajectory starting from state X k Sets IX

ΩU (i) k ΩX k

Decision Space at stage k for state i State space at stage k

Contents
Contents 1 Introduction 1.1 Background 1.2 Objective . 1.3 Approach . 1.4 Outline . . XI 1 1 2 2 2 5 5 6 11 11 13 13 15 15 18 23 23 25 25 26 26 29 29 31 31 31 32 33

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

2 Maintenance 2.1 Types of Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Maintenance Optimization Models . . . . . . . . . . . . . . . . . . . 3 Introduction to the Power System 3.1 Power System Presentation . . . . . . . . . . . . . . . . . . . . . . . 3.2 Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Main Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Introduction to Dynamic Programming 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Deterministic Dynamic Programming . . . . . . . . . . . . . . . . . . 5 Finite Horizon Models 5.1 Problem Formulation . . . . . . . . . . . . . . 5.2 Optimality Equation . . . . . . . . . . . . . . 5.3 Value Iteration Method . . . . . . . . . . . . 5.4 The Curse of Dimensionality . . . . . . . . . 5.5 Ideas for a Maintenance Optimization Model 6 Infinite Horizon Models - Markov 6.1 Problem Formulation . . . . . . . 6.2 Optimality Equations . . . . . . 6.3 Value Iteration . . . . . . . . . . 6.4 The Policy Iteration Algorithm . 6.5 Modified Policy Iteration . . . . 6.6 Average Cost-to-go Problems . . XI Decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

6.7 6.8 6.9

Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . Efficiency of the Algorithms . . . . . . . . . . . . . . . . . . . . . . . Semi-Markov Decision Process . . . . . . . . . . . . . . . . . . . . . Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34 35 35

7 Approximate Methods for 7.1 Introduction . . . . . . . 7.2 Direct Learning . . . . . 7.3 Indirect Learning . . . . 7.4 Supervised Learning . .

- Reinforcement Learning 37 . . . . . . . . 37 . . . . . . . . 38 . . . . . . . . 41 . . . . . . . . 42 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 44 45 45 47 47 55 59 61 63 65

8 Review of Models for Maintenance Optimization 8.1 Finite Horizon Dynamic Programming . . . . . . . 8.2 Infinite Horizon Stochastic Models . . . . . . . . . 8.3 Reinforcement Learning . . . . . . . . . . . . . . . 8.4 Conclusions . . . . . . . . . . . . . . . . . . . . . .

9 A Proposed Finite Horizon Replacement Model 9.1 One-Component Model . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Multi-Component model . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Possible Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Conclusions and Future Work A Solution of the Shortest Path Example Reference List

Chapter 1

Introduction
1.1 Background

The market and competition laws are introduced among power system companies due to the restructuration and deregulation of modern power system. The generating companies, as well as transmission and distribution system operators aim to minimize their costs. Maintenance costs can be a significant part of the total costs. The pressure to reduce the maintenance budget leads to a need for efficient maintenance. Maintenance cost be divided into Corrective Maintenance (CM) and Preventive Maintenance (PM) (see Chapter 2.1). CM means that an asset is maintained once an unscheduled functionnal failure occurs. CM can imply high costs for unsupplied energy, interruption, possible deterioration of the system, human risks or environment consequences etc. PM is employed to reduce the risk of unexpected failure. Time Based Maintenance (TBM) is used for the most critical components and Condition Based Maintenance (CBM) for the components that are worth and not too expensive to monitore. These maintenance actions have a cost for unsupplied energy, inspection, repair, replacement etc. An efficient maintenance should balance the corrective and preventive maintenance to minimize the total costs of maintenance. The probability of a functionnal failure for a component is stochastic. The probability depends on the state of component resulting from the history of the component (age, intensity of use, external stress (such as weather), maintenance actions, human 1

errors and construction errors). Stochastic Dynamic Programming (SDP) models are optimization models that integrate explicitely stochastic behaviors. This feature makes the models interesting and was the starting idea of this work.

1.2

Objective

The main objective of this work is to investigate the use of stochastic dynamic programming models for maintenance optimization and identify possible future applications in power systems.

1.3

Approach

The first task was to understand the different dynamic programming approaches. A first distinction was made between finite horizon and infinite horizon approaches. The different techniques that can be used for solving a model based on dynamic programming was investigated. For infinite horizon models, approximate dynamic programming was studied. These types of methods are related to the field of reinforcement learning. Some SDP models found in the literature was reviewed. Conclusions was made about the applicability of each approach for maintenance optimization problems. Moreover, future avenue for research was identified. A finite horizon replacement model was developed to illustrate the possible use of SDP for power system maintenance.

1.4

Outline

Chapter 2 solves an overview of the maintenance field. The most important methods and some optimization models are reviewed. Chapter 3 discusses shortly power systems. Some costs and constraints for optimization models are proposed. Chapter 4-7 focus on different Dynamic Programming (DP) approaches and algorithms to solve them. The assumption of the models and practical limitations are discussed. The basic of DP models is investigated in deterministic models in Chapter 4. Chapter 5 and 6 focus on Stochastic Dynamic Programming methods, 2

respectively for finite and infinite horizons. Chapter 7 is an introduction to Approximate Dynamic Programming (ADP), also known as Reinforcement Learning (RL), which is an approach to solving Dynamic Programming infinite horizon problems using approximate methods. Chapter 8 gives a review of some maintenance optimization models based on dynamic programming. Conclusions are made about possible use of the different approaches in maintenance optimization. Chapter 9 is an example of how finite horizon dynamic programming can be used for maintenance optimization. Chapter 10 summarizes the conlusions of the work and discuss possible avenues for research.

3

Chapter 2

Maintenance
The context of maintenance optimization is shortly described in this chapter. Different types of maintenance are defined in Section 2.1. Some maintenance optimization models are reviewed in Section 2.2.

2.1

Types of Maintenance

Maintenance is a combination of all technical, administrative and managerial actions during the life cycle of an item intended to retain it, or restore it to a state in which it can perform the required functions [1]. Figure 2.1 shows a general picture of the different types of maintenance.

Corrective Maintenance (CM) is carried out after fault recognition and intended to put an item into a state in which it can perform a required function [1]. It is typically performed in case there is no way or it is not worth detecting or preventing a failure. Preventive maintenance aims at undertaking maintenance actions on a component before it fails to e.g. avoid high cost of replacement, power delivery unsupplied and possible damages of the surrounding of the component. One can distinguish between two kind of preventive maintenance: 1. Time Based Maintenance (TBM) is preventive maintenance carried out in accordance with established intervals of time or number of units of use but without previous condition investigation [1]. TBM is used for failures that are age-related and for which the probability of failure on time can be established. 5

Maintenance

Preventive Maintenance

Corrective Maintenance

Time-Based Maintenance (TBM)

Condition Based Maintenance (CBM)

Continuous

Schedulled

Inspection Based

Figure 2.1: Maintenance Tree based on [1]

2. Condition Based Maintenance is preventive maintenance based on performance and/or parameter monitoring and the subsequent actions [1]. PM corresponds to all the maintenance methods using diagnostic or inspections to decide of the maintenance actions. Diagnostic methods include the use of human senses (noise, visual, etc.), measurements or tests. They can be undertaken continuously or during schedulled or requested inspections. CBM is often used for non-age related failures.

2.2

Maintenance Optimization Models

Unexpected failures of a component in a system can lead to expensive Corrective Maintenance. Preventive Maintenance approaches can be used to avoid CM. If preventive maintenance is done too frequently, it can however also result in a very high cost. The aim of the maintenance optimization could be to balance corrective and preventive maintenance to minimize, for example, the total cost of maintenance. Numerous maintenance optimization models have been proposed in the litterature and interesting reviews have been published. Wang [43] gives an interesting picture of maintenance policy optimization and its influence factors. Cho et. al. [15], Dekker et. al. [16] and Nicolai et. al. [31] focus mainly on multi-component problems. In this section, the most common classes of models are described and some references are given. This short review is based on Chapter 8 of [4]. 6

2.2.1

Age Replacement Policies

Under an age replacement policy, a component is replace at failure or at the end of a specified interval, whichever occurs first [17]. This policy makes sens if preventive replacement is less expensive than a corrective replacement and the failure rate increase with time. Barlow et. al. [7] describes a basic age replacement model. A model including discount have been proposed in [17]. In this model, the loss value of a replaced component decreases with its age. A model with minimal repair is discussed in [6]. If the component fails, it can be repaired to the same condition as before the failure occured. An age/block replacement model with failures resulting from shocks is described in [38]. The shocks follows a non-homogeneous Poisson distribution (Poisson process with a rate that is not stationnary). Two types of failures can result from the shocks: minor failure removed by minor repair, and major failure removed by replacement.

2.2.2

Block Replacement Policies

In blocks replacement policies, the components of a system are replaced at failure or at fixed times kT (k = 1, 2, ...), whichever occurs first. Barlow et. al. [7] describes a basic block replacement model. To avoid that a component that has just been replaced is replaced again, a modified block replacement model is proposed in [10]. A component is not replaced at a schedulled replacement time if its age is less than T. This model has been modified in [11] to model that the operational cost of an unit is higher when it becomes older. Moreover, the model of [10] is extended in [5] to allow multi-component systems with any discrete lifetime distribution.

2.2.3

Condition Based Maintenance

CBM is being introduced in many systems to avoid unnecessary maintenance and prevent incipient failure. In wind turbines, condition monitoring is being introduced for components like the gear box, blades etc. [32] One problem prior to the optimization is to identify relevant variables and identify their relation with failures modes and probabilities. CBM optimization models focus on different questions related to inspected/monitored components. One question is the optimal limits for the monitored variables above which it is necessary to perform maintenance. The optimal wear-limit for preventive replacement 7

of a component is derived in [34]. The model is extended in [35] to include different monitoring variables. For components subject to inspection, at each decision epoch, one must decide if maintenance should be performed and when the next inspection should occur. In [2], the inspection occur at fixed time and the decision of preventive replacement of the component depend on its condition at inspection. In [9], a Semi-Markov Decision Process (SMDP, see Chapter 4) is proposed to optimize at each inspection the maintenance decision and the time to next inspection. An age replacement policies model that takes into account the information from condition based monitoring devices is proposed in [25]. A proportional hazard model is used to model the effect of the monitored variables. The assumption of a hazard model is that the hazard function is the product of a two functions, one depending on the time and one on the parameters (monitored variables).

2.2.4

Opportunistic Maintenance Models

Opportunistics maintenance considers unexpected opportunities of performing preventive maintenance. With the failure of a component, it is possible to perform PM on other components. This could be interesting for offshore wind farms for example. The deplacement to the wind farm, by boat or helicopter is necessary and can be very expensive. By grouping maintenance actions, money could be saved. Haurie et. al. [19] focus on group preventive replacement policy of m identical components that are in the same condition. Both discrete and continuous time are considered and a dynamic programming equation is derived. The model is extended in [26] for m non-identical components. A rolling horizon dynamic programming algorithm is proposed in [45] to take into account the short term information. The model can be used for many maintenance optimization models.

2.2.5

Other Types of Models and Criteria of Classifications

Other models integrate the possibility of a limited number of spare parts or a possible choice between different spare part. E.g cannibalization models allows the re-use of some components or subcomponents of a system. Other criterias can be used to classify maintenance optimization models. The number of components in consideration is important, e.g. multi-components models are more interesting in power system. The time horizon considered in the model 8

is important. Many articles consider infinite time horizon. More focus should be done on finite horizon since they are more practical. Another characteristic of the model is the time representation, if discrete or continuous time is considered. One distinction can be done between models with deterministic and stochastic lifetime of components. Among stochastic approaches, it can be interesting to consider which kind of lifetime distribution can be used. The method used for solving the problem has an influence on the solution. A model that can not be solved is of no interest. For some model, exact solution are possible. For complex models, it is either necessary to simplify the model or to use heuristic methods to find approximate solutions.

9

Chapter 3

Introduction to the Power System
This chapter gives a brief description of electrical power systems. Some costs and constraints for a maintenance model are proposed.

3.1

Power System Presentation

Power systems are very complex. They are composed of thousands of components linked through a complex mesh of lines and cables that have limited capacities. With the deregulation of power systems, the generation, distribution and transmission systems are separated. Even considered independently, each part of the power system is complex with many components and subcomponents.

3.1.1

Power System Description

A simple description of the power system include the following main parts: 1. Generation: That are the generation units that produce the power. It can be e.g. hydro-power units, nuclear power plants, wind farms etc. The total power consumed is always equal to the power generated. 2. Transmission: The transmission system is composed of high voltage and high power lines. This part of the system is in general meshed. The transmission system connects distribution systems with generation units. 11

3. Distribution: The distibution system is a voltage level below transmission which is connected to customers. It connects distribution system with consumers. Distribution system are in general operated radial (One connection point to the transmission system). 4. Consumption: The consumer can be divided into different categories. Consumer can be industry, commercial, house, office, agriculture etc. The costs for interruption are in general different for the different categories of consumer. These costs will also depend on the time of outage. The trade of electricity between producers and consumers is made through different specific markets in the world. The rules and organization are different for each market place. The bids of electricity trades are declared in advance to the system operator. This is necessary to check that the power system can withstand the operationnal condition. The power system is controlled in real-time both automatically (automatic control and protection devices) and manually (with the help of the system operator to coordinate the necessary action to avoid dangerous situations). Each component of the system influence the other. If a component has a functional failure, it can induce failures of others component. Cascading failures can have drastic consequences such as black-outs.

3.1.2

Maintenance in Power System

The objective is to find the right way to do maintenance. Corrective Maintenance and Preventive Maintenance should be balanced for each component of a system and the optimal PM approaches should be determined. Reliability Centered Maintenance (RCM) is being introduced in power companies. (See [47] for an example in hydropower) RCM is an structured approach to find a balance between corrective and preventive maintenance. Research on Reliability Centered Asset Maintenance (RCAM), a quantitative approach to RCM, is being carried out in the RCAM group at KTH School of electrical engineering. Bertling et. al. [12] defined in details the approach and its different steps. An important step is the maintenance optimization. In Hilber et. al [20] a method based on a monetary importance index is proposed to define the importance of individual components in a network. Ongoing research focus for example on wind power. (See [39], [32]) Research about power generation is typically focusing on predictive maintenance using condition based monitoring systems. (See for example [18] or [44]) The problem of maintenance for transmission and distribution systems has received more 12

attention since the deregulation of the electricity market. (See for example [12], [27] for distribution systems, [22], [30] for transmission systems) The emergence of new condition based monitoring systems is changing the approach to maintenance in power system. There is a need for new models and methods to optimize the use of condition based monitoring systems.

3.2

Costs

Possible costs/incomes related to maintenance in power systems have been identified (non-inclusively) as follows: • Manpower cost: Cost for the maintenance team that performs maintenance actions. • Spare part cost: The cost of a new component is an important part of the maintenance cost. • Maintenance equipment cost: If special equipment is needed for undertaking the maintenance. An helicopter can sometime be necessary for the maintenance of some parts of an off-shore wind turbine. • Energy production: The electricity produce is sold to consumers on the electricity market. The price of electricity can fluctuate. At the same time the power produce by a generating power unit can fluctuate depending on factors like the weather (for renewable energy). The condition of the unit can also influence its efficiency. • Unserved energy/Interruption cost: If there is an agreement to produce/deliver energy to a consumer at some specific time, unserved energy must be paid. The cost depends on the contract and the cost per unit time depends on the duration of the failure. • Inspection/Monitoring cost: Inspection or monitoring systems have a cost that must be considered. The cost can be an initial investment (for continuous monitoring systems) or discret costs (each time an inspection, measurement or test is done on an asset)

3.3

Main Constraints

Possibles constraints for the maintenance of power system have been identified as follows: 13

• Manpower: The size and availability of the maintenance staff is limited. • Maintenance Equipment: The equipment needed for undertaking the maintenance must be available. • Weather: The weather can make certain maintenance actions postponed; e.g. in very windy conditions it is not possible to realize maintenance on offshore wind farms. • Availability of the Spare Part: If the needed spare parts are not available, maintenance can not be done. It can also happen that a spare part is available but far away from the location where it is needed. The transportation has a price and time. • Maintenance Contracts: Power companies can subscribe for maintenance services from the manufacturer of a system. This is a typical option for wind turbines [33]. The time span of a contract can be a constraint for an optimization model. • Availability of Condition Monitoring Information: If condition monitoring systems are installed on a system, the information gathered by the monitoring devices are not always available to non-manufacturer companies. The availability of monitoring information has an important impact is on the possible input for an optimization model. • Statistical Data: Available monitoring information have a value only if conclusions about the deterioration or failure state in a system can be drawn from them. Statistical data are necessary to create a probabilistic model.

14

Chapter 4

Introduction to Dynamic Programming
This chapter deals with general ideas about Dynamic Programming (DP) and some feature of possible DP models. Deterministic DP is used to introduce the basic of DP formulation and the value iteration method, a classical method for solving DP models.

4.1

Introduction

Dynamic Programming deals with multi-stage or sequential decisions problems. At each decision epoch, the decision maker (also called agent or controller in different contexts) observes the state of a system. (It is assumed in this thesis that the system is perfectly observable.) An action is decided based on this state. This action will result in an immediate cost (or reward) and influence the evolution of the system. The aim of DP is to minimize (or maximize) the cumulative cost (respectively income) resulting of a sequence of decisions. In the following, important ideas concerning Dynamic Programming are discussed.

4.1.1

Principle of Optimality

Dynamic programming is a way of decomposing a large problem into subproblems. It can be applied to any problem that observes the principle of optimality: 15

An optimal policy has the property that, whatever the initial state and optimal first decision may be, the remaining decisions constitute an optimal policy with regard to the state resulting from the first decision. [8]

The solution of the subproblems are themselves solution of the general problem. The principle implies that at each stage the decision are based only on the current state of the system. The previous decisions should not have influence on the actual evolution of the system and possible actions. Basically, in maintenance problems, it would mean that maintenance actions have only an effect on the state of the system directly after their accomplishment. They do not influence the deterioration process after they have been completed.

4.1.2

Deterministic and Stochastic Models

A system is said to be deterministic if the state at the next epoch depends only on the actual state and action made. If a system is subject to probabilistic events, it will evolve according to a probabilistic distribution depending on the actual state and action choice. The system is then refered to as probabilistic or stochastic. Functional failures are in general represented as stochastic events. In consequence, stochastic maintenance optimization models are interesting.

4.1.3

Time Horizon

The time horizon of a model is the time "window" considered for the optimization. One distinguishs between finite and infinite time horizons. Chapter 4 focus on finite horizon stochastic dynamic programming. In the context of maintenance, the objective would be, for example, to minimize the maintenance costs during the time horizon considered. Chapter 5 and 6 focus on models that assume an infinite time horizon. This assumption implies that a system is stationary, that it evolves in the same manner all the time. Moreover, an infinite horizon optimization assumes implicitely that the system is used for a infinite time. It can be an good approximation if indeed the lifetime of a system is very long. 16

4.1.4

Decision Time

In this thesis, we focus mainly on Stochastic Dynamic Programming (SDP) with discrete sets of decision epochs (Chapter 3, 4 and 6). Decisions are made at each decision epoch. The time is devided into stages or periods between these epochs. It is clear that the interval time between 2 stages will have an influence on the result. Short intervals are more realistitic and precise but the models can become heavy if the time horizon is large. In practice, long intervals can be used for long-term planning while short-term planning consider shorter intervals. Continum set of decision epochs implies that the decision can be made either continuously, at some points decided by the decision maker or when an event occur. The two last possibilities will be shortly investigated in Chapter 5. Continuous decision refers to optimal control theory and will not be discussed here.

4.1.5

Exact and Approximation Methods

Dynamic Programming suffers of a complexity problem, the curse of dimensionality (discussed in Section 4.2). Methods for solving the dynamic programming models exactly exist and are presented in Chapters 5 and 6. However, large models are untractable with these methods. Chapter 6 provide an introduction to the field of Reinforcement Learning (RL) that focus on approximations for DP solutions. Approximate algorithms are obtained by combining DP and supervised learning algorithms. RL is also known as neurodynamic programming when DP is combined with neural networks. [13]

17

4.2

Deterministic Dynamic Programming

This section introduces the basics of deterministic Dynamic Programming. The optimality equation is presented with the value iteration algorithm to solve it. The section is illustrated with a classical example of a simple shortest path problem.

4.2.1

Problem Formulation

The three main parts of a DP model are its state and decision spaces, dynamic and cost functions and objective function. The finite horizon model considers a system that evolves for N stages. State and Decision Spaces At each stage k, the system is in a state Xk = i that belongs to a state space ΩX . k Depending on the state of the system, the decision maker decide of an action to do, u = Uk ∈ ΩU (i). k Dynamic and Cost Functions As a result of this action, the system state at next stage will be Xk+1 = fk (i, u). Moreover, the action has a cost that the decision maker has to pay, Ck (i, u). A possible terminal cost is associated to the terminal state (state at stage N), (C N (XN ). Objective Function The objective is to determine the sequence of decision that will mimimize the cumulative cost (also called cost-to-go function) subject to the dynamic of the system:
∗ J0 (X0 ) = min N −1

Uk k=0

Ck (Xk , Uk ) + CN (XN )

Subject to Xk+1 = fk (Xk , Uk ) k = 0, ..., N − 1

N k i j Xk Uk Ck (i, u) CN (i) fk (i, u) ∗ J0 (i)

Number of stages Stage State at the current stage State at the next stage State at stage k Decision action at stage k Cost function Terminal cost for state i Dynamic function Optimal cost-to-go starting from state i 18

4.2.2

The Optimality Equation and Value Iteration Algorithm

The optimality equation (also known as Bellman´s equation) derives directly from the principle of optimality. It states that the optimal cost-to-go function starting from stage k can be derived with the following formula:
∗ ∗ Jk (i) = min {Ck (i, u) + Jk+1 (fk (i, u))} u∈ΩU (i) k ∗ Jk (i)

(4.1)

Optimal cost-to-go from stage k to N starting from state i

The value iteration algorithm is a direct consequence of the optimality equation:
∗ JN (i) = CN (i) ∀i ∈ XN ∗ Jk (i) = u∈ΩU (i) k ∗ min {Ck (i, u) + Jk+1 (fk (i, u))} ∀i ∈ Xk

∗ ∗ Uk (i) = argmin{Ck (i, u) + Jk+1 (fk (i, u))} ∀i ∈ Xk u∈ΩU (i) k

u ∗ Uk (i)

Decision variable lll Optimal decision action at stage k for state i

The algorithm goes backwards, starting from the last stage. It stops when k=0.

19

4.2.3

A Simple Shortest Path Problem Example

Deterministic dynamic programming can be used to solve simple shortest path problems with small state space. An example is used to illustrated the formulation and the value iteration algorithm. The following shortest path problem is considered:

B 2 4 3

A

C

D Stage 0 Stage 1

4 6 2 1 3 5 2

E

F

G Stage 2

2 5 7 3 2 1 2

H 4 I 2 7 J Stage 3 Stage 4 K

The aim of the problem is to determine the shortest way to reach the node K starting from the node A. A cost (corresponding to a distance) is associated to each arc. One first way to solve the problem would be to calculate the cost of all the possible path. For example, the path A-B-F-J-K has a cost of 2+6+2+7=17. Then the shortest path would be the one with the lowest cost. Dynamic programming provides a more efficient way to solve the problem. Instead of calculating all the path cost, the problem will be divided in subproblems that will be solved recursively to determine the shortest path from each possible node to the terminal node K.

4.2.3.1

Problem Formulation

The problem is divided into five stages,n=5, k={0,1,2,3,4}. State Space The state space is defined for each stage: ΩX = {A} = {0}ΩX = {B, C, D} = {0, 1, 2} ΩX = {E, F, G} = {0, 1, 2} 0 1 2 ΩX = {H, I, J} = {0, 1, 2}ΩX = {K} = {0} 3 4 20

Each node of the problem is defined by a state Xk . For example, X2 = 1 corresponds to the node F. In this problem, the state space is defined by one variable. It is also possible to have multi-variable space for which Xk would be a vector. Decision Space The set of decisions possible must be defined for each state at each stage. In the example, the choice is "which way should I take from this node to go to the next stage?". The following notations are used: ΩU (i) k for i = 0 {0, 1, 2} for i = 1 for k=1,2,3 =   {1, 2} for i = 2
  {0, 1} 

ΩU (0) = {0, 1, 2} for k=0 0

For example, ΩU (0) = ΩU (B) = {0, 1} with U1 (0) = 0 for the transition B ⇒ E or 1 U1 (0) = 1 for the transition B ⇒ F . Another example, ΩU (2) = ΩU (D) = {1, 2} with u1 (2) = 2 for the transition D ⇒ F 1 or u1 (2) = 2 for the transition D ⇒ G. A sequence π = {µ0 , µ1 , ..., µN }, where µk (i) is a function mapping the state i at stage k with an admissible control for this state, is called a policy. The value iteration algorithm determine the optimal policy of the problem, π ∗ = {µ∗ , µ∗ , ..., µ∗ }. 0 1 N Dynamic and Cost Functions The dynamic function of the example is simple thanks to the notations used: fk (i, u) = u. The transition costs are defined equal to the distance from one state to the resulting state of the decision. For example, C1 (0, 0) = C(B ⇒ E) = 4. The cost function is defined in the same way for the others stages and states. Objective Function
∗ J0 (0) =

Uk ∈ΩU (Xk ) k

min

4 k=0

Ck (Xk , Uk ) + CN (XN )

Subject to Xk+1 = fk (Xk , Uk ) k = 0, 1, . . . , N − 1

4.2.3.2

Solution

The value iteration algorithm is used to solve the problem. The algorithm is initiated from the last stage and then iterated backwards until 21

the initial state is reached. The optimal decision sequence is then obtained forward by using the optimal solution determined by the DP algorithm for the sequence of states that will be visited. The solution of the algorithm are given in Appendix A.
∗ The optimal cost-to-go is J0 (0) = 8. It corresponds to the following path A ⇒ D ⇒ G ⇒ I ⇒ K. The optimal policy of the problem is π ∗ = {µ0 , µ1 , µ2 , µ3 , µ4 } with µk (i) = u∗ (i) (for example µ1 (1) = 2, µ1 (2) = 2). k

22

Chapter 5

Finite Horizon Models
In this chapter, a stochastic version of the dynamic programming model in Chapter 3 is presented. The section introduces the theory for the proposed model in Chapter 9. For more details and examples, the book Markov Decision Processes: Discrete Stochastic Dynamic Programming [36] is recommended.

5.1

Problem Formulation

Stochastic dynamic programming can be used to model systems whose dynamic is probabilistic (or subject to disturbances). The state of the system at the next stage is not deterministic as in Chapter 5. It depends on the current state and decision but also on a stochastic variable that describes the disturbance, the stochastic behavior of the system. A stochastic dynamic programming model can be formulated as below: State Space A variable k ∈ {0, ..., N } represents the different stages of the problem. In general it corresponds to a time variable. The state of the system is characterized by a variable i = Xk . The possible states are represented by a set of admissible states that can depends on k, Xk ∈ ΩX . k Decision Space At each decision epoch, the decision maker must choose an action u = Uk among a set of admissible actions. This set can depend on the state of the system and on 23

the stage, u ∈ ΩU (i). k Dynamic of the System and Transition Probability On the contrary with the deterministic case, the state transition does not depend only on the control used but also on a disturbance ω = ωk (i, u) Xk+1 = fk (Xk , Uk , ω) k = 0, 1, . . . , N − 1 The effect of the disturbance can be expressed with transition probabilities. The transition probabilities define the probability that the state of the system at stage k+1 is j if the state and control are i and u at the stage k. These probabilities can depend also on the stage. Pk (j, u, i) = P (Xk+1 = j | Xk = i, Uk = u) If the system is stationary (time-invariant) the dynamic function f does not depends on time and the notation for the probability function can be simplified: P (j, u, i) = P (Xk+1 = j | Xk = i, Uk = u) In this case, one refers to a Markov decision process. If a control u is fixed for each possible state of the model, then the probability transition can be represented by a Markov model. (See Chapter 9 for an example) Cost Function A cost is associated to each possible transition (i,j) and action u. The costs can also depend on the stage. Ck (j, u, i) = Ck (xk+1 = j, uk = u, xk = i) If the transition (i,j) occurs at stage k when the decision is u, then a cost C k (j, u, i) is given. If the cost function is stationary then the notation is simplified by C(i, u, j). A terminal cost CN (i) can be used to penalize deviation from a desired terminal state. Objective Function The objective is to determine the sequence of decision that optimize the expected cumulative cost (cost-to-go function) J ∗ (X0 ) where X0 is the initial state of the system: J ∗ (X0 ) = min E{CN (XN ) +
N −1 k=0

Uk ∈ΩU (Xk ) k

Ck (Xk+1 , Uk , Xk )}

Subject to Xk+1 = fk (Xk , Uk , ωk (Xk , Uk )) k = 0, 1, . . . , N − 1 24

N k i j Xk Uk ωk (i, u) Ck (i, u, j) CN (i) fk (i, u, ω) ∗ J0 (i)

Number of stages Stage State at the current stage State at the next stage State at stage k Decision action at stage k Probabilistic function of the disturbance Cost function Terminal cost for state i Dynamic function Optimal cost-to-go starting from state i

5.2

Optimality Equation

The optimality equation for stochastic finite horizon DP is:
∗ ∗ Jk (i) = min E{Ck (i, u) + Jk+1 (fk (i, u, ω))} u∈ΩU (i) k

(5.1)

This equation define a condition for a cost-to-go function of a state i in stage k to be optimal. The equation can be re-written using the probability transitions:
∗ Jk (i) = min u∈ΩU (i) k ∗ Pk (i, u, j) · [Ck (i, u, j) + Jk+1 (j)] j∈ΩX k+1

(5.2)

ΩX k ΩU (i) k Pk (j, u, i)

State space at stage k Decision Space at stage k for state i Transition probability function

5.3

Value Iteration Method

The Value Iteration (VI) algorithm for SDP problems is directly based on equation 5.2. The algorithm starts from the last stage. By backward-recursions, it determines at each stage the optimal decision for each state of the system.
∗ JN (i) = CN (i) ∀i ∈ ΩX (Initialisation) N

While k ≥ 0 do: ∗ Jk (i) = min
∗ Uk (i) = argmin

u∈Uk (i) j∈ΩX

∗ Pk (i, u, j) · [Ck (i, u, j) + Jk+1 (j)] ∗ Pk (i, u, j) · [Ck (i, u, j) + Jk+1 (j)]

∀i ∈ ΩX k ∀i ∈ ΩX N

k+1

u∈Uk (i) j∈ΩX k+1

k ←k−1 25

u

∗ Decision variable Uk (i)

Optimal decision action at stage k for state i

The recursion finishes when the first stage is reached.

5.4

The Curse of Dimensionality

Consider a finite horizon stochastic dynamic problem with • N stages • NX states variables, the size of the set for each state variable is S • NU control variables, the size of the set for each control variable is A The time complexity of the algorithm is O(N · S 2·NX · ANU ). The complexity of the problem increases exponentionally with the size of the problem (number of state or decision variables). This characteristic of SDP is called the curse of dimensionality.

5.5

Ideas for a Maintenance Optimization Model

In this section, possible state variables for a maintenance models based on SDP are discussed.

5.5.1

Age and Deterioration States

The failure probability of components is often modelled as a function of time. A possible state variable for the component is its age. To be precise, the age of the component should be discretized according to the stage duration. If the lifetime of a component is very long it can lead to a very large state space. The time horizon can be considered to reduce the number of states. If a state variable can not reach certain states during the planned horizon, these states can be neglected. If a component, subcomponent or part of a system can be inspected or monitored, different levels of deterioration can be used as a state variable. In practice, both age and deterioration state variables could be used complementary. Of course maintenance states should be considered in both cases. It could be possible to have different types of failure states, as major failure and minor failures. Minor failures could be cleared by repair while for a major failure a component should be replace. 26

5.5.2

Forecasts

Measurements or forecasts can sometime estimate the disturbance a system is or can be subject to. The reliability of the forecasts should be carefully considered. Deterministic information could be used to adapt the finite horizon model on their horizon of validity. It would also be possible to generate different scenarios from forcasts, solve the problem for the different scenarios and get some conclusions from the different solutions. Another way of using forecasting models is to include them in the maintenance problem formulation by adding a specific variable. It will reduce the uncertainties but in return increase the complexity. The proposed model in Chapter 9 gives an example of how to integrate a forecasting model in an electricity scenario. Another factor that could be interesting to forecast is the load. Indeed the production must always be in balance with the generation. Also if there is no consumption, some generation units are stopped. This time can be used for the maintenance of the power plant. Weather forecasting could also be interesting in some cases. For example the power generated by wind farms depends on the wind strength, and maintenance action on offshore wind farms are possible only in case of good weather. For these two reasons, wind forecasting could be interesting for optimizing maintenance actions of offshore wind farms.

5.5.3

Time Lags

An important assumption of a DP model is that the dynamic of the system only depends on the actual state of the system (and possibly on the time if the system dynamic is not stationary). This condition of loss of memory is very strong and unrealistic in some cases. It is sometimes possible (if the system dynamic depends on few precedent states) to overcome this assumption. Variables are added in the DP model to keep in memory the precedent states that can be visited. The computational price is once again very high. For example, in the context of maintenance, it would be interesting to know the deterioration level of an asset at the precedent stage. It would give informations about the dynamic of the deterioration process.

27

Chapter 6

Infinite Horizon Models Markov Decision Processes
Infinite horizon models are models of systems that are considered stationary over time. The dynamic of the system as well as the cost function and the disturbances are stationary. Infinite horizon stochastic dynamic programming (IHSDP) models can be represented by a Markov Decision Process. For more details and proof for the convergence of the algorithm [36] or the introduction chpater of [13] are recommended. In practice, one scarcely faces problems with infinite number of stages. It can however be a reasonable approximation of problems with very large number of states for which the value algorithm would lead to untractable computation. The approximation methods presented in Chapter 7 are based on the methods presented in this chapter.

6.1

Problem Formulation

The state space, decision space, probability function and cost function of IHSDP are defined in a similar way that FHSDP for the stationary case. The aim of IHSDP is to minimize the cumulative costs of a system over an infinite number of stages. This sum is called cost-to-go function. An interesting feature of IHSDP models is that the solution of the problem is a stationary policy. It means that the solution of the problem has the form π = {µ, µ, µ...}. µ is a function mapping the state space with the control space. For 29

i ∈ ΩX , µ(i) is an admissible control for the state i, µ(i) ∈ ΩU (i). The objective is to find the optimal µ∗ . It should minimize the cost-to-go function. To be able to compare different policies, it is necessary that the infinite sum of costs converge. Different type of models can be considered; stochastic shortest path problems, discounted problems and average cost per stages problems. Stochastic shortest path models Stochastic shortest path dynamic programming models have a terminal state (or cost-free terminaison state) that is not evitable. When this state is reached, the system remains in this state and no costs are paid: J ∗ (X0 ) = min E{ lim
µ N −1

N →∞ k=0

C(Xk+1 , µ(Xk ), Xk )}

Subject to Xk+1 = f (Xk , µ(Xk ), ω(Xk , µ(Xk ))) k = 0, 1, . . . , N − 1 µ J ∗ (i) Decision policy Optimal cost-to-go function for state i

Discounted problems Discounted IHSDP models have a cost function that is discounted by a factor α is a discount factor (0 < α < 1). The cost function for discounted IHSDP has the form α · Cij (u). As Cij (u) is bounded, the infinite sum will converge (decreasing geometric progression). J ∗ (X0 ) = min E{ lim
µ N −1

N →∞ k=0

α · C(Xk+1 , µ(Xk ), Xk )}

Subject to Xk+1 = f (Xk , U k, ω(Xk , µ(Xk ))) k = 0, 1, . . . , N − 1 α Discount factor

Average cost per stage problems Infinite horizon problems can sometimes not be represented with a no free-cost terminaison state or discounted. To make the cost-to-go finite, the problem can modelled as an average cost per stage problem where the aim is to minimize: J ∗ = min E{ lim
µ 1 N →∞ k=0 N N −1

· C(Xk+1 , µ(Xk ), Xk )}

Subject to Xk+1 = f (Xk , U k, ω(Xk , µ(Xk ))) k = 0, 1, . . . , N − 1 30

6.2

Optimality Equations

The optimality equations are formulated using the probability function P (i, u, j). The stationary policy µ∗ solution of a IHSDP shortest path problem is solution of the Bellman´s equation (other name for the optimality equation - Bellman is the mathematician at the origin of the DP theory): Jµ (i) = Jµ (i) J ∗ (i)
µ(i)∈ΩU (i)

min

Pij (u) · [Cij (u) + Jµ (j)]
j∈ΩX

∀i ∈ ΩX

Cost-to-go function of policy µ starting from state i Optimal cost-to-go function for state i

For a IHSDP discounted problem the optimality equation is: Jµ (i) =
µ(i)∈ΩU (i)

min

Pij (u) · [Cij (u) + α · Jµ (j)]
j∈ΩX

∀i ∈ ΩX

The optimality equation for average cost-to-go IHSDP problems is discussed in Section 6.7.

6.3

Value Iteration

To solve the optimality equations, a first idea would be to use the value iteration algorithm presented in the Chapter 5. Intuitively, the algorithm should converge to the optimal policy. It can be shown that the algorithm will indeed converge to the optimal solution. If the model is discounted, then the method can be fast. The time complexity is in polynomial 1 time of the size of the state space, control space and 1−α . For non-discounted models, the theoretical number of iteration needed is infinite and a relative criteria must be determine to stop the algorithm. An alternative to the method is the Policy Iteration (PI) algorithm. This later terminates after a finite number of iteration.

6.4

The Policy Iteration Algorithm

Given a policy µ, the first step of the algorithm evaluates the policy by calculating the expected cost-to-go function resulting from this policy. The next step of the 31

algorithm improve the expected cost-to-go function by enhancing the actual policy. This 2-steps algorithm is used iteratively. The process stops when a policy is a solution of its own improvement. The algorithm starts with an initial policy µ0 . Then it can be described by the following steps: Step 1. Policy Evaluation µq+1 = µq stop the algorithm Else Jµq (i) solution of the following linear system is calculated Jµq (i) =
j∈ΩX

P (j, u, i) · [C(j, u, i) + Jµq (j)]

q

Iteration number for the policy iteration algorithm

This is the expected cost-to-go function of the system using the policy µq . Step 2. Policy Improvement A new policy is obtained using the value iteration algorithm: µq+1 (i) = argmin
u∈ΩU (i) j∈ΩX

P (j, u, i) · [C(j, u, i) + Jµq (j)]

Go back to policy evaluation step. The process stops when µq+1 = µq . At each iteration, the algorithm always improve the policy. If the initial policy µ 0 is already good, then the algorithm will converge fast to the optimal solution.

6.5

Modified Policy Iteration

If the number of states is large, solving the linear problem of the policy evaluation can be computational intensive. An alternative is to use at each stage the value iteration algorithm on a finite number of iterations M to estimate the value function of the policy . The algorithm M is initialized with a value function Jµk (i) that must be chosen higher than the real value Jµk (i). 32

While m ≥ 0 do
m Jµk (i) = j∈ΩX m+1 P (j, µk (i), i) · [C(j, µk (i), i) + Jµk (j)]

∀i ∈ ΩX

m←m−1 m Number of iteration left for the evaluation step of modified policy iteration

0 The algorithm stops when m=0 and Jµk is approximated by Jµk .

6.6

Average Cost-to-go Problems

The methods presented in Sections 5.1-5.4 can not be applied directly to average cost problems. Average cost-to-go problems are more complicated and implies conditions on the Markov decision process for the convergence of the algorithms. An average cost-to-go problem can be reformulated as equivalent to a shortest path problem if the model of the Markov decision process is proved to be unichain (That is, all stationary policies generate Markov chains that consist of a single ergodic class and possibly some transient states. See for details [36]). Given a stationary policy µ, a state X ∈ ΩX , there is an unique λµ and vector hµ such that hµ (X) = 0 λµ + hµ (i) =
j∈ΩX

P (j, µ(i), i) · [C(j, u, i) + hµ (j)]

∀i ∈ ΩX

This λµ is the average cost-to-go for the stationary policy µ. The average cost-to-go is the same for all the starting state. The optimal average cost and optimal policy satisfy the Bellman equation: λ∗ + h∗ (i) = argmin
µ(i)∈ΩU (i) j∈ΩX

P (j, µ(i), i) · [C(j, µ(i), i) + h∗ ] ∀i ∈ ΩX P (j, u, i) · [C(j, u, i) + h∗ ] ∀i ∈ ΩX
j∈ΩX

µ∗ (i) = argmin
u∈ΩU (i)

6.6.1

Relative Value Iteration

The value iteration method can be adapted to average cost-to-go problems. The method is called relative value iteration. X is an arbitrary state and h0 (i) is chosen 33

arbitrarly. Hk = hk+1 (i) = min
u∈ΩU (X)

min

P (j, u, i) · [C(j, u, i) + hk (X)]
j∈ΩX

u∈ΩU (i)

P (j, u, i) · [C(j, u, i) + hk (j)] − Hk
j∈ΩX

∀i ∈ ΩX

µk+1 (i) = argmin
u∈ΩU (i) j∈ΩX

P (j, u, i) · [C(j, u, i) + hk (j)]

∀i ∈ ΩX

The sequence hk will converge if the Markov decision process is unichain. Moreover, the algorithm converge to the optimal policy. The number of iterations needed is infinite in theory.

6.6.2

Policy Iteration

The problem can also be solved using the policy iteration algorithm. Initialisation X can be chosen arbitrarly. Step 1. Evaluation of the policy If λq+1 = λq and and hq+1 (i) = hq (i) Else solve the system of equation: hq (X) = 0 λq + hq (i) =
j∈ΩX

∀i ∈ ΩX stop the algorithm.

P (j, µ(q)(i), i) · [C(j, u, i) + hq (j)]

∀i ∈ ΩX

Step 2. Policy improvement µq+1 = argmin
u∈ΩU (i) j∈ΩX

P (j, u, i) · [C(j, u, i) + hq ] ∀i ∈ ΩX

q := q + 1

6.7

Linear Programming

The three types of IHSDP models can be reformulated to be solved with linear programming (LP) methods. The motivation for this apporach is that a linear programming model can include constraints that are not possible to include in a classical MDP model. However, the model become less intuitive than with the other methods. Moreover LP, can only be used for smaller state spaces than the value iteration and policy iteration methods. 34

For example, in the discounted IHSDP, Jµ (i) = argmin
µ(i)∈ΩU (i) j∈ΩX

P (j, u, i) · [C(j, u, i) + α · Jµ (j)]

∀i ∈ ΩX

Jµ (i) is solution of the following linear programming model: M inimize
i∈ΩX

Jµ (i)
j∈ΩX

Subject to Jµ (i) +

α · Jµ (j) · C(j, u, i) ≤

j∈ΩX

P (j, u, i) · C(j, u, i)∀u, i

At present linear programming has not proven to be an efficient method for solving large discounted MDPs; however, innovations in LP algorithms in the past decade might change this [36].

6.8

Efficiency of the Algorithms

For details about the complexity of the algorithms, [28] and [29] are recommended. If n and m denote the number of states and actions, this means that a DP method takes a number of computational operations that is less than some polynomial function of n and m. A DP method is guaranteed to find an optimal policy in polynomial time even though the total number of (deterministic) policies is mn . [41] But linear programming methods become impractical at a much smaller number of states than do DP methods. [41] Since the policy iteration algorithm always improve the policy at each iteration, the algorithm will converge quite fast if the initial policy µ0 is already good. There is strong empirical evidence in favor of PI over VI and LP in solving Markov decision processes [28].

6.9

Semi-Markov Decision Process

Until now, the decision epochs were predetermined at discrete time points (periodic in the case of infinite horizon problems). However, for some applications, the decision time can be random. For example the next decision time can be decided by the decision maker depending on the actual state of the system. Or the decision epoch occurs each time the state of the system is changing. This kind of problems refers to Semi-Markov Decision Processes (SMDP). SMDP generalize MDP by "1) allowing, or requiring, the decision maker to choose actions whenever the system state changes; 2) modeling the system evolution in 35

continuous time; and 3) allowing the time spent in a particular state to follow an arbitrary probability distibution." [36] The time horizon is considered infinite and the action are not made continuously (this kind of problems refer to optimal control theory). SMDP are more complicated than MDP and will not be part of this thesis. Puterman [36] explains how one can transform a SMDP model into a model solvable with the methods presented previously in this chapter. SMDP could be interesting in maintenance optimization since they allows a choice of inspection interval for each state of the system. However, due to the complexity of the models, only small state space are tractable.

36

Chapter 7

Approximate Methods for Markov Decision Process Reinforcement Learning
Reinforcement Learning (RL) or Approximate Dynamic Programming (ADP) is an approach of machine learning that combines infinite horizon dynamic programming with supervised learning techniques. Supervised learning techniques give the possibility to approximate the cost-to-go function on a large state space. The aim of this chapter is to give an overview to RL. For further interest, see the books Handbook of Learning and Approximate Dynamic Programming [40], NeuroDynamic Programming [13] and article [23].

7.1

Introduction

The problem of the methods presented in the previous chapter is that the models are untractable for large state space. In this chapter, methods to overcome this problem by approximation are presented. They make use of supervised learning techniques. Supervised learning is a field that investigates the creation of functions from training data (pairs input-output) to be able to predict future output for any kind of possible input data. Many approachs are possible, such as artificial neural networks, decision tree learning, bayesian statistics. One of the first reinforcement learning approaches was using artificial neural net37

works methods as supervised learning technique. This approach was also called neuro-dynamic programming (see [13]). Reinforcement learning methods refer to systems that "learn how to make good decisions by observing their own behavior, and use built-in mechanisms for improving their actions trough a reinforcement mechanism" [13]. The root of the algorithm proposed in RL are based on the methods of Chapter 6. The system is assumed to be stationary and be a Markov decision process. However RL does not require that an explicite model of the system exist. The methods can even be applied in parallel of learning the environment (the MDP of the system). This can be a practical advantage since a fastidious model does not need to be built first. The state and decision space are assumed known. The methods works on observed trajectory samples that have the form (Xk , Xk+1 , Uk , Ck ). The samples can be used to learn directly the cost-to-go function of a given policy or the Q-factor of a problem without estimating the probabilities transitions of the model. The first section deals with this type of learning Direct learning methods. This approach is useful for large state space. If a model of the system exist, the method can be used with samples from Monte Carlo simulations. In case of a real-time application, it is possible to combine the learning of the transition and cost functions with direct learning methods to take advantage of all the experience obtained. This approach is called Indirect learning (or model based methods) and will be discussed shortly. The RL methods are extension of the methods presented in Section 7.2. RL methods make use of supervised learning techniques to approximate the cost-to-go function over the whole state space. They are presented in Section 7.4.

7.2

Direct Learning

The aim of reinforcement learning is to infer good decisions based on samples of performance of the system provided from simulation or real-life experience. A sample has the form (Xk , Xk+1 , Uk , Ck ). Xk+1 is the observed state after chosing the control Uk in state Xk and Ck = C(Xk , Xk+1 , Uk ) is the cost resulting from this transition. The samples can be generated by Monte Carlo simulation according to the probabilities transitions P (j, u, i) and C(j, u, i) if a model of the system exists. 38

7.2.1

Policy Evaluation using Temporal Differences

Temporal differences (TD) is a method for estimating the cost-to-go function of a policy µ using samples resulting from the use of this policy. The method is used in the first step of the policy method discussed in Chapter 6. It can be seen in a similar way as the modified policy iteration. The cost-to-go function is estimated using the costs resulting of the simulation. Note that from each state visited, the remaining trajectory starting form this state can be used as a sample for the cost-to-go function. TD will be presented in the context of Stochastic shortest path problems, which means that there is a terminal state and every simulation terminate over a finite time. The method can also be adapted to discounted problems or average-cost-to-go problems. Policy evaluation by simulation Assume a trajectory (X0 , ..., XN ) has been generated according to the policy µ and the sequence of transition cost C(X k , Xk+1 ) = C(Xk , Xk+1 , µ(Xk )) have been observed. The cost-to-go resulting from the trajectory starting from the state X k is
N

V (Xk ) =
n=k

C(Xn , Xn+1 )

V (Xk )

Cost-to-go of a trajectory starting from state Xk

If a certain number of trajectories has been generated and the state i has been visited K times in these trajectories,J(i) can be estimated by J(i) = V (i, m) 1 K
K

V (i, m)
m=1

Cost-to-go of a trajectory starting from state i after the m th visit

A recursive form of the method can be formulated: J(i) := J(i) + γ · [V (i, m) − J(i)] with γ = 1/m with m the number of the trajectory. From a trajectory point of view J(Xk ) := J(Xk ) + γXk · [V (Xk ) − J(Xk )] γXk corresponding to 1/m where m is the number of time Xk has already been visited by trajectories. 39

With the precedent algorithm, it is necessary that V (Xk ) is calculated from the whole trajectory and then can be used when the trajectory is finished. However, the method can be reformulated exploiting the relation V (Xk ) = V (Xk+1 ) + C(Xn , Xn+1 ). At each transition of the trajectory the cost-to-go function of a state of the trajectory, J(Xk ) is updated. Assuming that the l th transition is being generated. Then J(Xk ) is updated for all the state that have been visited previously during the trajectory: J(Xk ) := J(Xk ) + γXk · [C(Xl , Xl+1 ) + J(Xl+1 ) − J(Xl )] ∀k = 0, ..., l TD(λ) A generalization of the precedent algorithm is the TD(λ) where a constant λ < 1 is introduced. J(Xk ) := J(Xk ) + γXk · λk−l · [C(Xl , Xl+1 ) + J(Xl+1 ) − J(Xl )] ∀k = 0, ..., l

Note that TD(1) this is the same that the Policy evaluation by simulation. Another special case is when λ = 0. The TD(0) algorithm is J(Xk ) := J(Xk ) + γXk · [C(Xl , Xl+1 ) + J(Xk+1 ) − J(Xk )] Q-factors Once Jµk (i) has been estimated using the TD algorithm, it is possible to make a policy improvement evaluating the Q-factors defined by Qµk (i, u) = j∈X P (j, u, i) · [C(j, u, i) + Jµ (j)] Note that C(j, u, i) must be known. The improved policy: µk+1 (i) = argmin Qµk (i, u)
u∈ΩU (i)

It is in fact an approximate version of the policy iteration algorithm since J µ and Qµk have been estimated using the samples.

7.2.2

Q-learning

Q-learning is similar to a value iteration methods based on simulation. The method estimates directly the Q-factors without the need of the multiple policy evaluation of the TD method. The optimal Q-factor are defined by: Q∗ (i, u) =
j∈ΩX

P (j, u, i) · [C(j, u, i) + J ∗ (j)] 40

(7.1)

The optimality equation can be rewritten in term of Q-factors: J ∗ (i) =
u∈U (Xk+1 )

min

Q∗ (i, u)

(7.2)

By combining the 2 equations, we obtain: Q∗ (i, u) =
j∈ΩX

P (j, u, i) · [C(j, u, i) + min Q∗ (j, v)]
v∈U (j)

(7.3)

Q∗ (i, u) is the unique solution of this equation. The Q-learning algorithm is base on (7.3): Q(i, u) can be initialized arbitrarly. For each sample (Xk , Xk+1 , Uk , Ck ) do Uk = argmin Q(Xk , u))
u∈U (Xk )

Q(Xk , Uk ) = (1 − γ)Q(Xk , Uk ) + γ · [C(Xk+1 , Uk , Xk ) + with γ defined as for TD.

u∈U (Xk+1 )

min

Q(Xk+1 , u)]

l

The trade-off exploration/exploitation The convergence of the algorithms to the optimal solution would imply that all the pair (x,u) are tried infinitely often which is not realistic. In practice, a trade-off must be made between phases of exploitation, when a base policy (called also greedy policy) is evaluated (which is similar to the idea of TD(0)) and phases of exploration during which new control are tried and a new greedy policy is determined.

7.3

Indirect Learning

On-line application can take advantage of the experience gained from real time use by: -Using the direct learning approach presented in the precedent section for each "sample" of experience. -Built on-line the model of the probabilities transitions and cost function and then use this model for off-line training of the system through simulation using direct learning. 41

7.4

Supervised Learning

With the methods presented in the precedent section the cost-to-go or Q-functions was represented on a tabular form. These approaches are suitable for moderate size problems. However, for large state and control space, this would be too computationnal intensive. To overcome this problem, approximation methods can be used to approximate the cost-to-go or Q-functions and the whole state and control space. As an example consider a cost-to-go function Jµ (i). It will be replaced by a suitable approximation J(i, r) where r is a vector that has to be optimized based on the samples available of Jµ . In the table representation precedently investigated, Jµ (i) was stored for all the value of i. With an approximation structure, only the vector r is stored. Functions approximators must be able to well generalize over the state space the information gained from the samples. In other words, it should minimize the error between the true function and the approximated one, Jµ (i) − J(i, r). There are a lot of possibles methods for function approximators. This field is related to supervised learning methods. Possibles methods are artificial neural networks, kernel-based methods or tree-based methods, bayesian statistics for example. A general approach to a supervised learning problem can be: • Determine an adequate structure for the approximated function and corresponding supervised learning method. • Determine the input features of the function, that is the important inputs that characterize the state of the system. The features are generally based on experience or insight about the problem. • Decide of a training algorithm. • Gathering a training set. • Train the function with the training set. The function can then be validated using a subset of the training set. • Evaluate the performance of the approximated function using a test set. An important difference between classical supervised learning and the one performed in reinforcement learning is that a real training set is not existing. The training set are obtained either by simulation or from real-time samples. This is already an approximation of the real function.

42

Chapter 8

Review of Models for Maintenance Optimization
This chapter reviews several SDP maintenance models found in the litterature. In conclusion, the approaches/methods are compared and their applicability to maintenance problem in power system is discussed.

8.1
8.1.1

Finite Horizon Dynamic Programming
Deterministic Models

Dekker & al. [46] proposes a rolling horizon approach for short-term scheduling and grouping of maintenance activities. Each individual maintenance activity is first based on an infinite horizon optimization. The short-term planning use these maintenance activities as inputs. Penalties are defined for deviations from the original time of maintenance for each activity. The whole maintenance activities are optimized using finite horizon dynamic programming.

8.1.2

Stochastic Models

In [37], a SDP model is proposed to solve a finite horizon generating units maintenance scheduling. The system considered is composed of n generating units. The possible state for each unit is the number of remaining stages of maintenance and possible failure of an unit not in maintenance during the stage. The failure rates 43

are assumed constant but different before and after maintenance. Unserved energy and unserved reserve costs are considered for the cost function. One interesting feature of the model is that the time to achieve maintenance is considered stochastic. Another is that the maintenance crew is assumed limited so maintenance can be done only on one generating unit at the time. The model is illustrated with a 3 unit example with 4, 5 and 6 possible states for the different units. A 52 weeks horizon is considered with stages of one week length.

8.2
8.2.1

Infinite Horizon Stochastic Models
Discrete Time infinite Horizon Models

In [14], an infinite horizon SDP model is considered for optimizing the maintenance of a single component system. The system can be in different deterioration states, maintenance states or in a failure state. Two kinds of failures are considered, random failure and deterioration failure. Each one modeled by a failure state with different time to repair. The time to deterioration failure is represented by an erlangian distribution. The preventive maintenance is considered imperfect. If the system fails, the component is replaced. An average cost-to-cost approach is used to evaluate the policy. First a Markov process of the system is investigated to determine the optimal mean time to preventive maintenance. A Markov decision process model is built using the states probabilities and the optimal mean time to preventive maintenance calculated. The MDP is solved using the policy iteration algorithm. The model is proved to be unichain before applying the algorithm. An illustrative example is given. It considers 3 deterioration states, one preventive maintenance state for each deterioration state and one failure state. Jayakumar et. al. [21] propose a similar MDP is proposed. Major and minor maintenance are possible are possible. For each possible maintenance action, the deterioration level after the maintenance is stochastic which is more realistic. The model is solved using the linear programming method. 44

8.2.2

Semi-Markov Decision Process

Many condition-based maintenance models based on SMDP have been proposed these last years. Amari et. al. [3] present a general framework for solving condition-based maintenance problems by using SMDP. The interest of the model is that for each possible deterioration state, possible maintenance decisions are minor maintenance, major maintenance (replacement) but also the choice for the next inspection time. An hypothetical example is given. The model consists of 5 deterioration states and 1 failure state. 20 possible values for the inspection time are considered. The model of [14] is extended to a SMDP in [42]. The inspection time is calculated prior to the optimization using a semi-Markov process. The SMDP model is said to superior because it includes the state sojourn time. The model is illustrated with an example based on a 230kV air blast circuit beaker.

8.3

Reinforcement Learning

Kalles et. al. [24] proposes the use of RL for preventive maintenance of power plants. The article aims at giving reason of using RL for monitoring and maintenance of power plants. The main advantages given are the automatic learning capabilities of RL. The problem of time-lag (time between an action and its effect) is revealed. Penalties are defined by deviations from normal operation of the system. The approach proposed should first be used in parallel of the actual expert systems so that the RL algorithm learns the environment then it could be applied in practice. One important condition for a good learning of the environment is that the algorithm has been trained in all situation and all the more in critical situation.

8.4

Conclusions

An important assumption of all the models is the loss of memory (Markovian models). The assumption is related to the principle of optimality. It means that the transition probability of the models can depend only on the actual state of the system independantly of its history. The finite horizon approach is adapted to short-term optimization. From the litterature review, this approach can be applied to maintenance scheduling. I believe that the approach is interesting because it can integrate opportunistic maintenance. Chapter 8 gives an example of this type of models. A limitations is the consequence 45

of the curse of dimensionality. The complexity of the model increases exponentionnaly with the number of states. In consequence the number of components of a finite horizon SDP model can not be too high for being tractable. Several Markov Decision Process and Semi-Markov Decision Processes models have been proposed for solving condition based maintenance problems. The models considers an average cost-to-go which is realistic. SMDP have the advantages of being able to optimize the time to next inspection depending on the states. SMDP are also more complex. The models found in the litterature was considering only single components with only one state variable. SMDP could be very useful for schedulled CBM and SMDP for inspection based CBM. However, for continuous time monitoring, it would be recommanded to use approximate methods. Approximate dynamic programming (reinforcement learning) have many advantages. The methods does not need that a model of the system exist. They learn from samples and could be used to adapt to a system. Moreover, they can handle large state space in comparison with MDP. In my opinion, reinforcement learning could be used for continuous time monitoring of system with multi-states monitoring. The article [24] was also proposing this approach for condition monitoring of power plants. However, no implementation of the idea have been found in the litterature. A practical disadvantage of this approach is that the process of learning is time consuming. It can (and should) be done off-line or based on a model that already exist but is too large to be solvable with classical methods. A technical difficulty is the choice for an adequate supervised learning structure. Table 8.1 shows a summary of the models and most important methods. Table 8.1: Summary of models and methods
Characteristics Possible Application in Maintenance Optimization Short-term maintenance Optimization / Scheduling Method Advantages/ Disadvantages Limitated state space (number of components)

Finite Horizon Dynamic Programming Markov Decision Processes

Model can be Non-Stationary -Stationary Model - Possible approaches: Average cost-to-go

Value Iteration Classical Methods for MDP Value Iteration (VI)

Discounted Shortest path

Continuous-time condition monitoring maintenance optimization Short-term maintenance optimization

Can converge fast for high discount factor Faster in general - Possible additional constraints - State space limited / VI & PI Can work without an explicit model

Policy Iteration (PI) Linear Programming

Approximate Dynamic Programming for MDP Semi-Markov Decision Processes

Can handle large state space / classical MDP methods -Can optimize interval inspection -Complex

Same as MDP for larger systems Optimization for inspection based maintenance (Average cost-to-go approach)

- TD-learning - Q-learning Same as MDP

46

Chapter 9

A Proposed Finite Horizon Replacement Model
A finite horizon SDP replacement model is proposed in this chapter. The model assumes a finite time horizon and discrete decision epochs. The system in consideration is a power generating unit. An interesting feature of the model is the integration of the electricity price as a state variable. Another is the possibility of opportunistic maintenance i.e if one component fails, it is possible to do preventive maintenance on another component that is still working. The proposed model is first presented for one component and is then generalized to multi-components. Both these models can be solved using the value iteration algorithm.

9.1
9.1.1

One-Component Model
Idea of the Model

In this chapter, an age replacement model based on finite horizon dynamic programming is proposed. The model is first described for one component for an easier understanding of its principle. The price of electricity was considered as an important factor that could influence the maintenance decision. Indeed if the electricity price is high, it can be profitable to operate the system and wait for lower prices. If a high electricity price is expected in a close future, it could be interesting to 47

do maintenance immediately to be operational later and avoid maintenance in a profitable period. The idea was considered for the model. The electricity price was included as a state variable. The variable consider different electricity scenario, for example high, medium and low prices. For each scenario, the electricity price vary with a period of a year. There can be transitions from one scenario to another depending on the period of the year. In the scandinavian countries, a large part of the electricity is based on hydropower. The electricity price is in consequence highly influenced by the weather. If the weather is warm and dry the hydro-storage will be low and the electricity price for the rest of the year may be high. On the opposite, a cold and rainy season may result in low electricity price for the rest of the year. This observation could be used to assume the electricity scenario to be transiant during the summer and stable during the rest of the year, typically interpreted as dry year or wet year. This assumption could be used as a base for modelling the transition for the electricity state.

9.1.2

Notations for the Proposed Model

Numbers NE NW NPM N CM Costs CE (s, k) CI CP M C CM C N (i) Variables i1 i2 j1 j2 Component state at the current stage Electricity state at the current stage Possible component state for the next stage Possible electricity state for the next stage Electricity cost at stage k for the electricity state s Cost per stage for interruption Cost per stage of Preventive maintenance Cost per stage of Corrective maintenance Terminal cost if the component is in state i Number Number Number Number of of of of electricity scenario working state for the component preventive maintenance state for one component corrective maintenance state for one component

State and Control Space 48

x1 k x2 k

Component state at stage k Electricity state at stage k

Probability function λ(t) λ(i) Sets Ωx Ω2 ΩU (i)
1

Failure rate of the component at age t Failure rate of the component in state Wi

Component state space Electricity state space Decision space for state i

States notations W. P M. CM. Working state Preventive maintenance state Corrective maintenance state

9.1.3

Assumptions

• The time span of the problem is T. It is divided into N stages of length Ts such that T = N · Ts . The maintenance decision are made sequentially at each stage k=0,1,...,N-1. • The failure rate of the component over the time is assumed perfectly known. This function is denoted λ(t). • If the component fails during stage k, corrective maintenance is undertaken for N CM stages with a cost of C CM per stage. • It is possible at each stage to decide to replace the component to prevent corrective maintenance. The time of preventive replacement is N P M stages with a cost of C P M per stage. • If the system is not working, a cost for interruption C I per stage is considered. • The average production of the generating unit is G kW. It means that if the unit is not in preventive maintenance or failure, G · Ts kWh are produced during the stage. (Ts in hours) • NE possible electricity price scenarios are considered. The prices are supposed fixed during a stage (equal to the price at the beginning of scenario). For scenario s, the electricity price per kWh is noted CE (s, k), k=0,1,...,N-1. It is possible that the electricity price "switch" from one scenario to another one during the time span. The probability of transition at each stage is assumed known. 49

• A terminal cost (for stage N) can be used to penalize the terminal stage condition. • The manpower is assumed unlimited. Spare parts are not considered.

9.1.4
9.1.4.1

Model Description
State Space

The state vector Xk is composed of two states variables, x1 for the state of the k component (its age) and x2 for the electricity scenario. NX = 2 k The state of the system is thus represented by a vector as in (9.1): Xk = x1 k x1 ∈ Ω 1 , x2 ∈ Ω 2 x x k k x2 k (9.1)

Ωx1 is the set of possible states for the component and Ωx2 the set of possible electricity scenarios. Component state The status of the component (its age) at each stage is represented by one state variable x1 . There are three types of possible states for the variable. Normal k state (W), when the component is working, corrective maintenance (CM) states if the component is in maintenance due to failure and preventive maintenance (PM) states. The meaning of a state is that the component has been in the corresponing condition during the last stage. For example, if the component is in a state PM, it means that during the last stage it has undertaken preventive maintenance. The number of CM and PM states for the component corresponds respectively to N CM and N P M . To limit the size of the state space, it is necessary to limit the number of states W. It can be assumed that when λ(t) reaches a fixed limit λmax = λ(Tmax ), preventive maintenance is always made. Another possibility is to assume that λi (t) stays constant when age Tmax is reached. In this case, Tmax can correspond for example at the time when λ(t) > 50% if t>Tmax . This approach was implemented. The corresponding number of W states is N W = Tmax /Ts or the closest integer in both cases.

50

CM2

1

CM1

1 Ts λ(0) Ts λ(1) Ts λ(2) Ts λ(3) Ts λ(4)

W0

(1 − Ts λ(0))

W1

(1 − Ts λ(1))

W2

(1 − Ts λ(2))

W3

(1 − Ts λ(3))

W4 (1 − Ts λ(4))

1 1

1

1

1

1

P M1

Figure 9.1: Example of Markov Decision Process for one component with N CM = 3, N P M = 2, N W = 4. Solid line: u=0, Dashed Line: u=1

Figure 9.1 shows an example of graphical representation of the MDP model for one 1 component. In this example, x1 ∈ Ωx = {W0 , ..., W4 , P M1 , CM1 , CM2 }. The State k W0 is used to represent a new component. P M2 and CM3 are both represented with this state. More generally, Ωx = {W0 , ..., WN W , P M1 , ..., P MN P M −1 , CM1 , ..., CMN CM −1 }.
1

51

Electricity scenario state Electricity scenarios are associated with one state variable x2 . There are NE possible k states for this variable, each state corresponding to one possible electricity scenario. 2 x2 ∈ Ωx = {S1 , ..., SNe } The electricity price of the scenario S at stage k is given k by the electricity price function CE (S, k). Figure 9.2 shows an example for three possibles scenarios. The example considers three electricity scenarios correspond to high, medium and low electricity prices (respectively dry, normal and wet year). The weather during the season influence the water reserve in a country as Sweden. Hydropower is a large part of the electricity generation in Sweden. Moreover this is a cheap source of energy. In consequence, if there is a low water reserve, more expensive source of energy are needed and the electricity price is higher.

Electricity Prices SEK/MWh

500 450 400 350 300 250 200 k-1 k k+1 Stage
         

1/3 Scenario 2 1/3 1/3 Scenario 1
 

Figure 9.2: Example of electricity scenarios, NE = 3

52

 

 

 

Scenario 3

9.1.4.2

Decision Space

At each stage, the decision maker can decide, if the component is not in maintenance, to do preventive maintenance or not depending on the state X of the system. Uk = 0 no preventive maintenance Uk = 1 preventive maintenance The decision space depends only on the component state i1 . ΩU (i) = {0, 1} if i1 ∈ {W1 , ..., WN W } ∅ else

9.1.4.3

Transition Probabilities

The two state variables are independant. Moreover only the electricity state transitions depend on the stage. Consequently, P (Xk+1 = j | Uk = u, Xk = i) = P (x1 = j 1 , x2 = j 2 | uk = u, x1 = i1 , x2 = i2 ) k+1 k+1 k = P (x1 = j 1 | uk = u, x1 = i1 ) · P (x2 = j 2 | x2 = i2 ) k+1 k k+1 k = P (j 1 , u, i1 ) · Pk (j 2 , i2 ) Component state transition probability At each stage k, if the state of the component is Wq the failure rate is assumed constant during the time of the stage and equal to λ(Wq ) = λ(q · Ts ). The transition probability for the component state is stationary. It can be represented as a Markov decision process as in the example in Figure 9.1. Table 9.1 summarizes the transition porbabilities that not equal to zero. Note that if N P M = 1 or N CM = 1 then P M1 respectively CM1 correspond to W0 . Electricity State The transition probabilities of the electricity state, Pk (j 2 , i2 ) are not stationary. They can change from stage to stage. 9.1.4.3 with 9.3 give an example of transition probabilities for the electricity scenarios on a 12 stages horizon. In this example, 1 2 Pk (j 2 , i2 ) can take three different values defined by the transition matrices PE , PE 3 . i2 is represented by the rows of the matrices and j 2 by the column. or PE 53

Table 9.1: Transition probabilities i1 Wq q ∈ {0, ..., N W − 1} Wq q ∈ {0, ..., N W − 1} WN W WN W Wq q ∈ {0, ..., N W } P Mq q ∈ {1, ..., N P M − 2} P MN P M −1 CMq q ∈ {1, ..., N CM − 2} CMN CM −1
  

u 0 0 0 0 1 ∅ ∅ ∅ ∅

j1 Wq+1 CM1 WN W CM1 P M1 P Mq+1 W0 CMq+1 W0


P (j 1 , u, i1 ) 1 − λ(Wq ) λ(Wq ) 1 − λ(WN W ) λ(WN W ) 1 1 1 1 1
 

Table 9.2: Example of transition matrix for electricity scenarios 1/3 1/3 1/3   2 PE = 1/3 1/3 1/3 1/3 1/3 1/3 1 1 PE 2 1 PE 3 3 PE 4 3 PE 5 2 PE 6 2 PE 7 2 PE

1 0 0   1 P E = 0 1 0  0 0 1 Stage(k) Pk (j 2 , i2 ) 9.1.4.4 0 1 PE

0.6 0.2 0.2   3 PE = 0.2 0.6 0.2 0.2 0.2 0.6 8 3 PE 9 1 PE 10 1 PE 11 1 PE ... ...

Table 9.3: Example of transition probabilities on a 12 stages horizon

Cost Function

The costs associated to the possible transitions can be of different kinds: • Reward for electricity generation= G · Ts · CE (i2 , k) (depends on the electricity scenario state i2 and the stage k). • Cost for maintenance, CCM or CP M . • Cost for interruption, CI . Moreover, a terminal cost noted CN could be used to penalized deviations from required state at the end of time horizon. This option and its consequences was not studied in this work. The transition cost are summarized in Table 9.4. Notice that i2 is a state variable. A possible terminal cost is defined by C N (i) for each possible terminal state C N (i) for the component.

54

Table 9.4: Transition costs i1 Wq q ∈ {0, ..., N W − 1} Wq q ∈ {0, ..., N W − 1} WN W WN W Wq P Mq q ∈ {1, ..., N P M − 2} P MN P M −1 CMq q ∈ {1, ..., N CM − 2} CMN CM −1 u 0 0 0 0 1 ∅ ∅ ∅ ∅ j1 Wq+1 CM1 WN W CM1 P M1 P Mq+1 W0 CMq+1 W0 Ck (j, u, i) G · Ts · Cel (i2 , k) C I + C CM G · Ts · CE (i2 , k) C I + C CM CI + CP M CI + CP M CI + CP M C I + C CM C I + C CM

9.2

Multi-Component model

In this section, the model presented in Section 9.1 is extended to multi-components systems.

9.2.1

Idea of the Model

The motivation for a multi-component model is to consider possible opportunistic maintenance. It is sometimes possible to do maintenance on different parts of the system at opportunistic times. For example if the system fails, it could be profitable to do maintenance on some components of the system that are still working but should be maintained soon. This could be very interesting if the interruption cost is high or if the structure needed for the maintenance is very high. In wind power for example, for certain maintenance actions, an helicopter or a boat can be necessary. The price for their rent can be very high and it could be profitable to group the maintenance of different wind turbines at the same time.

9.2.2

Notations for the Proposed Model

Numbers NC W Nc PM Nc CM Nc Number Number Number Number of of of of component working state for component c Preventive Maintenance state for component c Corrective Maintenance state for component c 55

Costs
P Cc M CM Cc N Cc (i)

Cost per stage of Preventive Maintenance for component c Cost per stage of Corrective Maintenance for component c Terminal cost if the component c is in state i

Variables ic , c ∈ {1, ..., NC } iNC +1 j c , c ∈ {1, ..., NC } j NC +1 uc , c ∈ {1, ..., NC } State of component c at the actual stage State for the electricity at the actual stage State of component c for the next stage State for the electricity for the next stage Decision variable for component c

State and Control Space xc , c ∈ {1, ..., NC } k xc xNC +1 k uc k State of the component c at stage k A component state Electricity state at stage k Maintenance for component c at stage k

Probability functions λc (i) Sets Ωx N +1 Ωx C c Ωu (ic )
c

Failure probability function for component c

State space for component c Electricity state space Decision space for component c in state ic

9.2.3

Assumptions

• The system is composed of NC components in series. If one component fails, the whole system fails. • The failure rate of each component over the time is assumed perfectly known. This function is noted λc (t) for component c ∈ {1, ..., NC }. • If component c fails during stage k, corrective maintenance is undertaken for CM stages with a cost of C CM per stage. Nc c • It is possible at each stage to decide to replace a component to prevent corrective maintenance. The time of preventive replacement for component n is P P Nc M stages with a cost of Cc M per stage. 56

• An interruption cost C I is consider whatever the maintenance is done on the system. • The average production of the generating unit is G kW. If none of the component of the unit is in preventive maintenance or failure, G·Ts kWh is produced during the stage. (Ts in hours)
N • A terminal cost Cc can be used to penalize the terminal stage condition for component c.

9.2.4
9.2.4.1

Model Description
State Space

The state of the system can be represented by a vector as in (9.2).


Xk =  

x1 k  .   .  .
   x Nc  k
N xk c +1



(9.2)

xc , c ∈ {1, ..., NC } represent the state of component c. k xNc +1 represents the electricity state. k Component Space The number of CM and PM states for component c corresponds respectively to CM and N P M . The number of W states for each component c, N W , is decided in Nc c c the same way that for one component. The state space related to the component c is noted Ωx . xc ∈ Ωx = {W0 , ..., WNc , P M 1, ..., P MNc M −1 , CM 1, ..., CMNc −1 } W P CM k Electricity Space Same as in Section 8.1.
c c

9.2.4.2

Decision Space

At each stage, the decision maker must decide for each component, that is not in maintenance, to do preventive maintenance or do nothing depending on the state of the system. 57

uc = 0 no preventive maintenance on component n k uc = 1 preventive maintenance on component n k The decision variables constitute a decision vector: u1 k  u2  k Uk =  .  .  .
      

(9.3)

u Nc k

The decision space for each decision variable can be defined by: ∀c ∈ {1, ..., Nc }, Ωu (ic ) =
c

{0, 1} if ic ∈ {W0 , ..., WNc } W ∅ else

9.2.4.3

Transition Probability

The state variables xc are independent of the electricity state xNc +1 . Consequently, P (Xk+1 = j | Uk = U, Xk = i) = P ((j 1 , ..., j NC ), (u1 , ..., uNC ), (i1 , ..., iNC )) · P (j NC +1 , j NC +1 ) (9.4) (9.5)

The probabilities transition of the electricity states, P (j NC +1 , iNC +1 ) , are similar to the one-component model. They can be defined at each stage k by a transition matrices as in the example of Section 8.1. Component states transitions The state variables xc are not independent of each other. Indeed, if one component fails or is in maintenance, the components are not ageing since the system is not working. In consequence, different cases must be considered. Case 1 If all the component are working, no maintenance is done, the propability transition of the whole system is the product of the probability transition of each component considered independently.
c If ∀c ∈ {1, ..., NC }, yk ∈ {W1 , ..., WNn }, W NC

P ((j , ..., j Case 2

1

NC

), 0, (i , ..., i

1

NC

)) =
c=1

P (ic , 0, j c )

58

If one of the component is in maintenance or the decision of preventive maintenance is:
NC

P ((j , ..., j

1

NC

), (u , ..., u

1

NC

), (i , ..., i

1

NC

)) =
n=1

Pc

with P c =

 P (j c , 1, ic ) if uc = 1 or ic ∈ {W1 , ..., WN W }   c   0 else

1 if ic ∈ {W0 , ..., WNc −1 } and ic = j c W

9.2.4.4

Cost Function

As for the transition probabilities, there are 2 cases: Case 1 If all the components are working, no maintenance is decided and no failure happens, a reward for the electricity produced is obtained.
c , ∀c ∈ {1, ..., NC }, yk ∈ {W1 , ..., WNn } If W

C((j 1 , ..., j NC ), 0, (i1 , ..., iNC )) = G · Ts · CE (iNC +1 , k) Case 2 When the system is in maintenance or fails during the stage, an interruption cost C I is considered as well as the sum of all the maintenance actions.
NC

C((j 1 , ..., j NC ), (u1 , ..., uNC ), (i1 , ..., iNC )) = C(I) +
c=1

Cc

with C c =

 CM if ic ∈ {CM , ..., CM c  Cc CM  1 Nc } or j = CM1   c  0 else

C P M if ic ∈ {P M1 , ..., P MNc M } or j n = P M1 P

9.3

Possible Extensions

The model could be extended in several directions. The following list summarizes some ideas on issues that could impact on the model: • Manpower. It would be interesting to limit the number of maintenance actions possible to do at the same time. A solution would be to consider a global decision space and not individual decision space for each component state variable. 59

• Include other types of maintenance actions. In the model, replacement was the only maintenance action possible. In reality there are a lot of possible maintenance actions, such as minor repair, major repair etc. They could be modelled by adding possible maintenance decisions in the model. • Time to repair is non deterministic. So that it is possible to model a stochastic reparation time by adding probabilities transition for the maintenance states. • Use of deterioration states. If monitoring or inspection of some components are possible, deterioration state variables could be included in the model. • Other forecasting states. It could be interesting to add other forecasting state information such as weather and/or load states.

60

Chapter 10

Conclusions and Future Work
This thesis has reviewed models and methods based on Stochastic Dynamic Programming (SDP) and their application to maintenance problems. The theory of Dynamic Programming was introduced with finite horizon and infinite horizon stochastic approaches as well as Approximate Dynamic Programming (Reinforcement Learning) methods to solve infinite horizon SDP models. A comparison of the methods available for infinite horizon SDP was made. Problems with a limited state space can be solved exactly. The Policy Iteration algorithm is proved empirically to converge the faster. However for high discount rate, the Value Iteration algorithm can be better. Linear Programming can also be used if additional constraints need to be included in the model. Approximate Dynamic Programming methods are necessary for large state space. A maintenance model based on finite horizon Stochastic Dynamic Programming was proposed to illustrate the theory. An interesting idea of the model was to enable opportunistic maintenance. Different ideas of state variables and possible extensions was also proposed. A literature review of Dynamic Programming application to maintenance optimization was made. Finite horizon deterministic and stochastic dynamic programming have been mainly applied to short term maintenance scheduling. The idea of grouping maintenance activities on a finite horizon seems promising to avoid untractable models. Markov Decision Processes (MDP) and Semi-Markov Decision Processes (SMDP) is proposed in many articles to optimize maintenance decision based on condition monitoring systems. The advantage of SMDP is to be able to optimize the next time to maintenance depending on the actual state of the system. Only single state variable models have been found in the literature for both MDP and SMDP. No application of Approximate Dynamic Programming (ADP) has not been found in the literature but a proposition of application. 61

The main limitation of Dynamic Programming is related to the curse of dimensionnality. The time complexity increases exponentionnaly with the number of state variables in the model. With the new advances in ADP methods, this limitation could be overcome. No application of ADP was found in the litterature. The methods have been mainly applied to optimal control until now but their is new opportunities for applying them to new fields such as maintenance optimization. The condition based maintenance models proposed using MDP or SMDP may be e.g. generalized to multi-variables models where different parameters of a system are monitored. In the power industry, maintenance contracts for a finite time is common. In this perspective, maintenance optimization should focus on finite horizon models. However in the litterature, few finite horizon models are proposed. Two ways of using Dynamic Programming for finite horizon models are possible; Either directly a finite horizon model or with a discounted infinite horizon model which is an approximate finite horizon model that must be stationnary over the time. An idea could be to extend the finite horizon model proposed in this thesis. Markov Decision Process and reinforcement learning could be applied to single-components monitoring (with possible monitoring of multi-parameters) while the finite approach could use the results from the single-components models to optimize the maintenance of a complete system. The component in the finite horizon model could be simplified to a few number of possible deterioration/age states to limit the complexity of the model.

62

Appendix A

Solution of the Shortest Path Example
Solution of the shortest path problem with the value iteration algorithm Stage 4 J ∗ (4, 0) = φ(0) = 0 Stage 3 ∗ J3 (0) = J ∗ (H) = C(3, 0, 0) = 4 u∗ (0) = u∗ (H) = 0 3 ∗ (1) = J ∗ (I) = C(3, 1, 0) = 2 u∗ (1) = u∗ (I) = 0 J3 3 ∗ J3 (2) = J ∗ (J) = C(3, 2, 0) = 7 u∗ (2) = u∗ (J) = 0 3 Stage 2 ∗ ∗ ∗ J2 (0) = J ∗ (E) = min {J3 (0) + C(2, 0, 0), J3 (1) + C(2, 0, 1)} = min {4 + 2, 2 + 5} = 6 ∗ (0) = J ∗ (E) = argmin ∗ (0) + C(0, 0), J ∗ (1) + C(1, 0)} = 0 u2 u∈{0,1} {J3 3 ∗ ∗ ∗ J2 (1) = J ∗ (F ) = min {J ∗ (3, 0) + C(2, 1, 0), J3 (1) + C(2, 1, 1), J3 (2) + C(2, 1, 2)} = min {4 + 7, 2 + 3, 7 + 2} = 5 ∗ ∗ ∗ u∗ (1) = J ∗ (F ) = argminu∈{0,1,2} {J3 (0) + C(2, 1, 0), J3 (1) + C(2, 1, 1), J3 (2) + C(2, 1, 2)} = 2 2 ∗ (2) = J ∗ (G) = min {J ∗ (1) + C(2, 2, 1), J ∗ (2) + C(2, 2, 2)} = min {2 + 1, 7 + 2} = 3 J2 3 3 ∗ ∗ u∗ (2) = J ∗ (G) = argminu∈{1,2} {J3 (1) + C(2, 2, 1), J3 (2) + C(2, 2, 2)} = 1 2 Stage 1 ∗ ∗ ∗ J1 (0) = J ∗ (B) = min {J2 (0) + C(1, 0, 0), J2 (1) + C(1, 0, 1)} = min {6 + 4, 5 + 6} = 10 ∗ u∗ (0) = J ∗ (B) = argminu∈{0,1} {J ∗ 2( 0) + C(1, 0, 0), J2 (1) + C(1, 1, 0)} = 0 1 ∗ (1) = J ∗ (C) = min {J ∗ (0) + C(1, 1, 0), J ∗ (1) + C(1, 1, 1), J ∗ (2) + C(1, 1, 2)} = min {6 + 2, 5 + 1, 3 + 3} = 6 J1 2 2 2 ∗ ∗ ∗ u∗ (1) = J ∗ (C) = argminu∈{0,1,2} {J2 (0) + C(1, 1, 1), J2 (1) + C(1, 1, 1), J2 (2) + C(1, 1, 2)} = 1 or 2 1 ∗ ∗ ∗ J1 (2) = J ∗ (D) = min {J2 (1) + C(1, 2, 1), J2 (2) + C(1, 2, 2)} = min {5 + 5, 3 + 2} = 5 ∗ (2) = J ∗ (D) = argmin ∗ (1) + C(1, 2, 1), J ∗ (2) + C(1, 2, 2)} = 2 u1 u∈{1,2} {J2 2 Stage 0 ∗ ∗ ∗ ∗ J0 (0) = J ∗ (A) = min {J1 (0) + C(0, 0, 0), J1 (1) + C(0, 0, 1), J1 (2) + C(0, 0, 2)} = min {10 + 2, 6 + 4, 5 + 3} = 8 ∗ (0) + C(0, 0, 0), J ∗ (1) + C(0, 0, 1), J ∗ (2) + C(0, 0, 2)} = 2 ∗ (0) = J ∗ (A) = argmin u0 u∈{0,1,2} {J1 1 1

63

Reference List
[1] Maintenance terminology. Svensk Standard SS-EN 13306 SIS, 2001. [2] Mohamed A-H. Inspection, maintenance and replacement models. Comput. Oper. Res., 22(4):435–441, 1995. [3] S.V. Amari and L.H. Pham. Cost-effective condition-based maintenance using markov decision processes. Reliability and Maintainability Symposium, 2006. RAMS’06. Annual, pages 464–469, 2006. [4] N. Andréasson. Optimisation of opportunistic replacement activities in deterministic and stochastic multi-component systems. Technical report, Chalmers, Göteborg University, 2004. Licentiate Thesis. [5] Y.W. Archibald and R. Dekker. Modified block-replacement for multiplecomponent systems. IEEE Transactions on Reliability, 45(1):75–83, 1996. [6] I. Bagai and K. Jain. Improvement, deterioration, and optimal replacement underage-replacement with minimal repair. IEEE Transactions on Reliability, 43(1):156–162, 1994. [7] R. E. Barlow and F. Proschan. Mathematical Theory of Reliability. Wiley, 1965. [8] R. Bellman. Dynamic Programming. Princeton University Press, Princeton, 1957. [9] C. Berenguer, C. Chu, and A. Grall. Inspection and maintenance planning: an application of semi-Markov decision processes. Journal of Intelligent Manufacturing, 8(5):467–476, 1997. [10] M. Berg and B. Epstein. A modified block replacement policy. Naval Research Logistics Quarterly, 23:15–24, 1976. [11] M. Berg and B. Epstein. A note on a modified block replacement policy for units with increasing marginal running costs. Naval Research Logistics Quarterly, 26:157–179, 1979. 65

[12] L. Bertling, R. Allan, and R. Eriksson. A reliability-centered asset maintenance method for assessing the impact of maintenance in power distribution systems. IEEE Transactions on Power Systems, 20(1):75–82, 2005. [13] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. [14] GK Chan and S. Asgarpoor. Optimum maintenance policy with Markov processes. Electric Power Systems Research, 76(6-7):452–456, 2006. [15] D.I. Cho and M. Parlar. A survey of maintenance models for multi-unit systems. European journal of operational research, 51(1):1–23, 1991. [16] R. Dekker, R.E. Wildeman, and F.A. van der Duyn Schouten. A review of multi-component maintenance models with economic dependence. Mathematical Methods of Operations Research (ZOR), 45(3):411–435, 1997. [17] B. Fox. Age Replacement with Discounting. Operations Research, 14(3):533– 537, 1966. [18] C. Fu, L. Ye, Y. Liu, R. Yu, B. Iung, Y. Cheng, and Y. Zeng. Predictive maintenance in intelligent-control-maintenance-management system for hydroelectric generating unit. IEEE Transactions on Energy Conversion, 19(1):179–186, 2004. [19] A. Haurie and P. L’Ecuyer. A stochastic control approach to group preventive replacement in a multicomponent system. IEEE Transactions on Automatic Control, 27(2):387–393, 1982. [20] P. Hilber and L. Bertling. Monetary importance of component reliability in electrical networks for maintenance optimization. In Probabilistic Methods Applied to Power Systems, 2004 International Conference on, pages 150–155, September 2004. [21] A. Jayakumar and S. Asgarpoor. Maintenance optimization of equipment by linear programming. In Probabilistic Methods Applied to Power Systems, 2004 International Conference on, pages 145–149, 2004. [22] Y. Jiang, Z. Zhong, J. McCalley, and TV Voorhis. Risk-based Maintenance Optimization for Transmission Equipment. Proc. of 12th Annual Substations Equipment Diagnostics Conference, 2004. [23] L. P. Kaelbling, M. L. Littman, and A. P. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996. [24] D. Kalles, A. Stathaki, and R.E. Kingm. Intelligent monitoring and maintenance of power plants. In Workshop on «Machine learning applications in the electric power industry», Chania, Greece., 1999. 66

[25] D. Kumar and U. Westberg. Maintenance scheduling under age replacement policy using proportional hazards model and TTT-plotting. European Journal of Operational Research, 99(3):507–515, 1997. [26] P. L’Ecuyer and A. Haurie. Preventive replacement for multicomponent systems: An opportunistic discrete time dynamic programming model. IEEE Transactions on Automatic Control, 32:117–118, 1983. [27] M. Lehtonen. On the optimal strategies of condition monitoring and maintenance allocation in distribution systems. In Probabilistic Methods Applied to Power Systems, 2006. PMAPS 2006. International Conference on, pages 1–5, 2006. [28] M.L. Littman. Algorithms for Sequential Decision Making. PhD thesis, Brown University, 1996. [29] Y. Mansour and S. Singh. On the complexity of policy iteration. Uncertainty in Artificial Intelligence, 99, 1999. [30] M.K.C. Marwali and S.M. Shahidehpour. Short-term transmission line maintenance scheduling in a deregulated system. Power Industry Computer Applications, 1999. PICA’99. Proceedings of the 21st 1999 IEEE International Conference, pages 31–37, 1999. [31] R.P. Nicolai and R. Dekker. Optimal maintenance of multi-component systems: a review. 2006. [32] J. Nilsson and L. Bertling. Maintenance management of wind power systems using condition monitoring systems-life cycle cost analysis for two case studies. IEEE Transaction on Energy Conversion, 22(1):223–229, 2007. [33] Julia Nilsson. Maintenance management of wind power systems - cost effect analysis of condition monitoring systems. Master’s thesis, Royal Institute of Technology (KTH), April 2006. [34] K.S. Park. Optimal wear-limit replacement with wear-dependent failures. IEEE Transactions on Reliability, 37(3):293–294, 1988. [35] K.S. Park. Condition-based predictive maintenance by multiple logisticfunction. IEEE Transactions on Reliability, 42(4):556–560, 1993. [36] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994. [37] A. Rajabi-Ghahnavie and M. Fotuhi-Firuzabad. Application of markov decision process in generating units maintenance scheduling. In Probabilistic Methods Applied to Power Systems, 2006. PMAPS 2006. International Conference on, pages 1–6, 2006. 67

[38] Rangan, Alagar, Ahyagarajan, Dimple, and Sarada. Optimal replacement of systems subject to shocks and random threshold failure. International Journal of Quality & Reliability Management, 23:1176–1191, 2006. [39] J. Ribrant and L. M. Bertling. Survey of failures in wind power systems with focus on swedish wind power plants during 1997-2005. IEEE Transaction on Energy Conversion, 22(1):167–173, 2007. [40] J. Si. Handbook of Learning and Approximate Dynamic Programming. WileyIEEE, 2004. [41] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. [42] C.L. Tomasevicz and S. Asgarpoor. Optimum maintenance policy using semimarkov decision processes. In Power Symposium, 2006. NAPS 2006. 38th North American, pages 23–28, 2006. [43] H. Wang. A survey of maintenance policies of deteriorating systems. European Journal of Operational Research, 139(3):469–489, 2002. [44] L. Wang, J. Chu, W. Mao, and Y. Fu. Advanced maintenance strategy for power plants - introducing intelligent maintenance system. In Intelligent Control and Automation, 2006. WCICA 2006. The Sixth World Congress on, volume 2, 2006. [45] R. Wildeman, R. Dekker, and A. Smit. A dynamic policy for grouping maintenance activities. European Journal of Operational Research. [46] R.E. Wildeman, R. Dekker, and A. Smit. A Dynamic Policy for Grouping Maintenance Activities. Econometric Institute, 1995. [47] Otto Wilhelmsson. Evaluation of the introduction of RCM for hydro power generators at vattenfall vattenkraft. Master’s thesis, Royal Institute of Technology (KTH), May 2005.

68

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close