Masters Thesis

Published on March 2017 | Categories: Documents | Downloads: 42 | Comments: 0 | Views: 566
of 121
Download PDF   Embed   Report

Comments

Content

Copyright by Shobha Vasudevan 2003

Automatic Verification of Arithmetic Circuits in RTL using Term Rewriting Systems

by Shobha Vasudevan, B.E.

THESIS Presented to the Faculty of the Graduate School of The University of Texas at Austin in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE IN ENGINEERING

THE UNIVERSITY OF TEXAS AT AUSTIN December 2003

Automatic Verification of Arithmetic Circuits in RTL using Term Rewriting Systems

APPROVED BY SUPERVISING COMMITTEE:

Jacob A. Abraham, Supervisor Nur A. Touba Adnan Aziz

To Krishna, for being my quest. . . To Amma and Daddy, for showing me the way. . .

Acknowledgments

I’d like to thank my advisor, Dr. Jacob Abraham for his invaluable support and guidance through the course of this work. His novel ideas, infectious enthusiasm and intellectually stimulating discussions kept me motivated and encouraged through the entire course of my Graduate Studies. Thank you Sir, for your firm belief in me. It kept me going in the most trying times. I’d also like to thank my colleague and fellow PhD student, Vinod Viswanath, for his support and assistance through my Masters. His experience, insight, resourcefulness, skills and alacrity have been a priceless source of inspiration and and help in obtaining this degree. Without his contribution, I don’t imagine I could have got this far. I’d like to thank Linda, Andrew, Shirley and Ruth for their promptness and efficiency in matters that required their attention. I’d also like to thank my lab-mates for their co-operation. I’d like to thank my friends Siddarth and Kunal, for bringing a lot of joy in my life in the U.S. Lastly, I’d like to thank my parents and sister for making me who I am.

v

Automatic Verification of Arithmetic Circuits in RTL using Term Rewriting Systems

Shobha Vasudevan, M.S.E The University of Texas at Austin, 2003

Supervisor: Jacob A. Abraham

We present a novel technique to formally verify arithmetic circuit designs at the Register Transfer Level (RT-Level). Our technique involves translation of circuits in Verilog RTL to Term Rewriting Systems (TRS). We verify the target design using a simple, correct reference design with the same functionality. We translate the two designs into TRSs. We introduce a theory of equivalence of two TRSs. Using this theory, we prove the correctness of the target design with respect to the reference design. Our tool, Verifire automates the entire technique. We demonstrate the applicability of this technique on adder designs. We illustrate the power of this technique when applied to multiplier verification. We show a detailed proof of correctness, as output by our tool, of a

¢

Booth multiplier and a

Wallace Tree multiplier. We also show the extension of our tool to modifications of standard multipliers, with BISMUL, a modified Booth multiplier.

vi

Table of Contents

Acknowledgments Abstract List of Tables List of Figures Chapter 1. Introduction 1.1 Motivation and Prior Work . . . . . . . . . . . . . . . . . . . . . . 1.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Organization Of Thesis . . . . . . . . . . . . . . . . . . . . . . . . Chapter 2. An Overview of our Verification Methodology 2.1 Choice of Representation . . . . . . . . . . . . . . . . . . . . . . .

v vi ix x 1 1 3 4 6 7

Chapter 3. Term Rewriting Systems 10 3.0.1 Example Term Rewriting System . . . . . . . . . . . . . . . 13 Chapter 4. Equivalence of Term Rewriting Systems 4.1 Definition of theory of equivalence of TRSs . . . . . . . . . . . . . 4.2 Alternative Definition and Proof for Equivalence of TRSs . . . . . . 4.3 Computing Comparison points . . . . . . . . . . . . . . . . . . . . Chapter 5. Verifire : A fully automated proof generator 16 16 20 22 24

Chapter 6. Arithmetic Circuit Verification 26 6.1 Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.2 Shifters and Comparators . . . . . . . . . . . . . . . . . . . . . . . 33

vii

Chapter 7. Multiplier Verification 7.1 Booth Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 BISMUL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Wallace Tree Multiplier . . . . . . . . . . . . . . . . . . . . . . . .

34 34 38 43

Chapter 8. Results and Discussion 48 8.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 8.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chapter 9. Conclusions 53 9.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Appendices Appendix A. An ACL2 Implementation of our Technique A.1 Project Description . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Verification of the 74181 ALU in ACL2 . . . . . . . . . . . . . . A.2.1 Using ACL2 . . . . . . . . . . . . . . . . . . . . . . . . . A.2.2 The 74181 ALU . . . . . . . . . . . . . . . . . . . . . . . A.3 Applying the technique to 16 bit adder operation of 74181 in ACL2 A.4 Verification of a RISC pipeline using our technique . . . . . . . . Appendix B. Verilog Code for the Shift-and-Add Multiplier 56 57 57 57 57 58 60 67 72 81 89 106 111

. . . . . .

Appendix C. Verilog Code for the Booth Multiplier Appendix D. Verilog Code for the BISMUL Multiplier Bibliography Vita

viii

List of Tables

7.1 7.2 7.3 8.1

Partial product terms of the booth multiplier. . . . . . . . . . . . . . 36 The partial product terms in a BISMUL . . . . . . . . . . . . . . . 41 The inputs of each PPSEL . . . . . . . . . . . . . . . . . . . . . . 41 Comparison of execution times of Verifire against two commercial equivalence checkers for a Booth multiplier of varying sizes. In each case the golden model was a shift and add multiplier of the corresponding size. . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Comparison of execution times of Verifire against two commercial equivalence checkers for a Wallace Tree multiplier of varying sizes. In each case the golden model was a shift and add multiplier of the corresponding size. . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Comparison of execution times of Verifire against one commercial equivalence checker assisted by manual comparison points. Results are shown for both booth and wallace tree multipliers. . . . . . . . . 50

8.2

8.3

ix

List of Figures

2.1 6.1

Relevant subset of allowed Verilog constructs. Verilog key words are in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

Proof of correctness of the Carry Lookahead adder compared against the Ripple Carry Adder. ¯ represents terms of the RCA. Æ represents terms of the CLA. The variable within is the observable variable updated after a set of rewrites. represents the expression equivalence between the observable terms of the two systems at each comparison point. The rewriting and the expression equivalence form the two engines of the Vprover. . . . . . . . . . . . . . . . . 32 Architecture of a Booth multiplier. . . . . . . . . . . . . . . . . . . Proof of correctness of the Booth multiplier compared against the Shift&Add multiplier. ¯ represents terms of the Shift&Add multiplier. Æ represents terms of the Booth multiplier. R× represents the rules of the Shift&Add multiplier at every stage (Rule x and Rule y). R ÓÓØ represents the corresponding Booth multiplier rules (Rule a ...Rule h). The variable within is the observable variable updated after a set of rewrites. Here it is product. represents the expression equivalence between the observable terms of the two systems at each comparison point. The rewriting and the expression equivalence form the two engines of the Vprover. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Architecture of a BISMUL. . . . . . . . . . . . . . . . . . . . . . . Architecture of a -bit Wallace Tree Multiplier. . . . . . . . . . . . 35

7.1 7.2

7.3 7.4

39 39 43

x

Chapter 1 Introduction

Verification of large designs at the Register Transfer Level (RT-Level) is a widely studied problem. Verification of arithmetic circuits, especially integer multipliers presents an interesting nexus of challenge and opportunity. State-ofthe-art verification techniques cannot verify these large circuits. In this work, we present a verification technique to formally verify large arithmetic circuit designs at the RT-level. We propose and describe the theory and technique that performs equivalence checking between two RT-level designs.

1.1 Motivation and Prior Work
Many attempts have been made to verify RT-level designs using Boolean level techniques like model checking, ATPG-based and SAT solving [3]–[6], [12], [31]. However, these techniques use Boolean brute force algorithms, and cannot effectively verify large RTL designs, due to the state space explosion problems. Verification of functional correctness of arithmetic circuits particularly multipliers [3]–[5],[14] has been a longstanding problem. Existing tools and techniques do not perform verification effectively between two RTL designs. State of the art equivalence checkers are general purpose tools that verify circuit designs efficiently

1

only at the gate level. Though they can handle many generic circuits at this level, they cannot verify large multipliers. Also, the two designs being compared have to be very similar for reasonable performance. Completely disparate designs are not trivial to verify against each other. There have been some efforts to verify large arithmetic circuits like multipliers using theorem provers and proof checkers [2], [8], [19], [26]. These are few in number since multiplier verification using a theorem prover is a hard problem. This is because considerable user expertise and ingenuity is required to prove the lemmas pertinent to the design [28]. If the multiplier design builds on designs (like adder and shifter) that exist in the rule database, the lemmas corresponding to the multiplier itself can be very intricate and numerous. If, however, the multiplier is built without any of these infrastructural lemmas that exist in the rule database, then the task of generating and proving lemmas is extremely complicated [20]. Additionally, theorem provers do not handle Verilog designs directly. Although there are translators that convert Verilog to ACL2, a large number of infrastructural lemmas for handling the translated RTL need to be written and proved [20]. Kapur et al provide a methodology for specifying and verifying a family of parameterized multiplier circuits within the framework of the Rewrite Rule Laboratory (RRL) theorem prover [8], [18]. These circuits, however, are high level functional abstractions and are not close to the implementation level. Also, the generality of these circuits is limited to their parametric size variations. The technique is not suitable for a generic design. 2

1.2 Our Approach
We propose a technique for formally verifying combinational arithmetic circuits, including multipliers at the RT-level. We prove the equivalence of an implementation (or revised) Verilog RTL design against a specification (or golden) Verilog RTL design. We translate the golden and revised combinational Verilog RTL designs into Term Rewriting Systems (TRSs) [21]. We then prove input/output equivalence between the two TRSs. This notion of equivalence is compositional and thus also affords proof decomposition through stepwise refinement. In order to decompose this proof, we compute a set of comparison points. Our tool, Verifire, is a dedicated arithmetic circuit checker that automates (i.e needs minimal user guidance) our technique. It translates the golden and revised designs to TRSs. It accepts the Verilog descriptions of these designs, and converts them into the intermediate formal representation automatically. It also automatically generates the comparison points for the stepwise refinement in the equivalence proof of the two TRSs. It uses an incomplete, but efficient method to generate equivalent proofs at the comparison points. Arithmetic circuits are often designed and optimized incrementally from a base design. This can afford the development of generically applicable decomposition strategies for equivalence proofs of the optimized design against the base design. We have found this to be the case for several families of integer arithmetic circuit designs and have leveraged this intuition in our tool, to verify the optimized design against the base design. We have applied our technique successfully to the verification of optimized adders, comparators, shifters, and multipliers. 3

We split the space of multiplier designs into standard (base) designs and the modified (optimized) designs. Standard designs are widely used multiplier designs like Booth, Wallace Tree and Array multiplier. We prove the equivalence of optimizations to these designs against the standard designs and prove the equivalence of the standard designs against a simple golden Shift-and-Add multiplier. We demonstrate the application of the technique to adders. In order to illustrate the actual power of our technique, we present the proof of correctness of non-trivial examples, a

¢

Booth multiplier and a Wallace Tree multiplier, verified against a

Shift-and-Add multiplier. We also show the intuition for proving BISMUL [30] an optimized Booth multiplier, correct, against the Booth multiplier. We also show a comparison of our approach with existing Boolean equivalence checkers. All of the analysis performed by the tool is done on terms composed of RTL operators (e.g. bitwise-and, left-shift, etc.) as opposed to the Boolean netlist level – the level at which equivalence checkers operate. Terms are more concise and more efficient to manipulate than netlists. The potential downside is the incompleteness of the analysis, but we have found that with proper decomposition, the requisite analysis is reasonable to implement.

1.3 Organization Of Thesis
Chapter 2 explains our technique in detail, and gives a justification for selecting TRSs as our representation. Chapter 3 gives a description of TRSs and describes the features of the TRSs translated from Verilog HDL. In Chapter 4 we introduce a notion of equivalence between TRSs and provide an outline for our proof 4

technique. We present our tool Verifire in Chapter 5. In Chapter 6, we show the application of our technique and the efficiency of our tool in the domain of (integer) arithmetic circuits. We also provide an illustrative proof of correctness for Carry Lookahead Adders in this chapter. We provide a detailed correctness proof for the Booth and Wallace Tree multipliers and outline the proof for BISMUL in Chapter 7. In Chapter 8 we provide the results of comparing our tool against commercial equivalence checkers for large multiplier designs. The chapter also discusses the merits and demerits of our technique in relation to other state-of-the-art techniques. Finally, conclusions and future work are presented in Chapter 9.

5

Chapter 2 An Overview of our Verification Methodology

We present an RT-level equivalence checking technique for large arithmetic circuits. A simple RTL design with the same functionality as the arithmetic circuit to be verified (revised design) is the golden design. For instance, the ripple carry adder would be the golden design for verification of all target adder designs. Our proposed technique involves the translation of the golden and the revised RTL designs to an intermediate, well-known formalism, Term Rewriting Systems (TRS) [21]. We treat Verilog as a programming language with deterministic semantics. We model Verilog program transformations by term rewriting. We have created a framework for this translation, that we provide here. We have developed the theory and methodology for checking and establishing the equivalence of TRSs. The equivalence checking of the two RTL designs is done by by checking their corresponding TRSs for equivalence. The notion of equivalence used is input/output equivalence. Our verification methodology is as follows. We translate the target design that has to be verified into a TRS. We translate a reference design that performs the same function as the target design, and is known to be correct, into its corresponding TRS. We prove that the two TRSs are equivalent by using our notion of observation

6

equivalence. This technique can be used to verify very large multiplier designs, since it scales well with the size of the design.

2.1 Choice of Representation
We translate our designs in the Hardware Description Language, Verilog, into an intermediate formalism, Term Rewriting Systems, as mentioned above. We present a justification for this choice of representation, since we believe that it will illustrate the application and motivation behind the approach. Term Rewriting Systems are composed of terms and rules that rewrite these terms into each other. A formal definition will follow in Chapter 3. The merits of using term-level modeling of Verilog designs will be discussed later in Section 8.2. We would like to present the intuition for some of the ideas here. Terms are a very succinct and intuitive way to model system behavior. They can encapsulate both data and control information, syntactic as well as semantic information for the system they model, with the same interface. Also, they allow composition and information hiding to the desired extent. The next section Chapter 3 talks about the widespread application of TRSs in many areas. Part of the reason for that is the flexibility of the term representations. In our framework, the source and target logics (in this case, two Verilog program representations) are at the same level of abstraction, with a close semantic relationship and expressive power. The intermediary representation should be sufficiently expressive to serve as a target for translation, so that the translation itself is natural, i.e, results and updates on the data types in the source and target logics 7

can be interpreted and expressed as corresponding results in the representation. In the past, TRS has been used as an intermediary representation providing a theorem proving framework for hardware verification [27]. Our choice of TRS as an abstraction is also due to its expressive framework that allows for convenient hierarchical representations. Since our method involves the translation of Verilog programs that have a hierarchical structure, the mapping between the two domains is very intuitive. Also, TRS lends itself to accurate and detailed behavior of hardware systems, by allowing the modeling of concurrency and nondeterminism. The relation between Hardware Description Languages and TRS is well established with the development of the TRAC [13]. Our technique involves the reverse mapping between these two behavioral description systems. While the hardware synthesized by the TRAC is used to build correct designs, our aim is to verify existing designs. We have identified a subset of the Verilog HDL that we can translate into TRSs. This subset is synthesizable by commercial tools. We follow the IEEE draft standard for Synthesizable RTL Verilog/VHDL [11]. The circuit being translated needs to conform to this subset of Verilog. We provide a grammar for this synthesizable subset of Verilog in Figure 2.1 The framework for translating Verilog designs into TRSs will be detailed in the next section.

8

module definition ::= module module name [parameter list] declaration list statement list endmodule declaration list declaration variable type statement list statement ::= declaration declaration list declaration ::= variable type identifier list [:= expression] ::= input output wire reg ::= statement statement list statement ::= always statement if statement module call statement variable assignment statement

Figure 2.1: Relevant subset of allowed Verilog constructs. Verilog key words are in bold.

9

Chapter 3 Term Rewriting Systems

We present a brief introduction to TRSs in this section. A Term Rewriting System can be represented by ´Ì of terms and

ʵ, where Ì is the set

Ê is the set of all rules in the system. A rewrite rule is shown as
pointing to the direction of the rewrite. A TRS

´Ø½

ؾ µ where ؽ ؾ ¾ Ì , with

is terminating if there are no infinite rewrite sequences ؽ

ؾ

. A TRS is

confluent if any divergence in rewriting is eventually joined. A normal form is a term which cannot be rewritten any further. Termination ensures the existence of normal forms, while confluence ensures their uniqueness. We translate circuit designs implemented in Verilog RTL [24] to Term Rewriting Systems. We have tried to translate arithmetic circuits using this framework. Currently, we are working in the domain of combinational circuits, and do not consider sequential circuits. Therefore, we do not translate non-blocking assignments in Verilog that can model non-deterministic semantics. In this domain of circuits, Verilog can be considered as having deterministic imperative programming language semantics. Some steps have been taken in this direction [16] for C-like languages. Some other steps have been taken for formalizing the semantics of Verilog [1], [29], [32].

10

We follow a set of rules for the translation of Verilog into Term Rewriting Systems. Every rewrite rule is a structure-preserving program transformation in the Verilog design. (Here we use the terms design and program interchangeably, since a design written in Verilog is viewed as a program). Collectively, the set of all such rules can be viewed as the Term Rewriting System for the Verilog design. The left hand side of a rule is matched if the variables or the actual parameters are the same. If matched, these variables/function symbols are rewritten, according to the corresponding update for that variable in the Verilog program. Therefore, all it assignments and module calls in Verilog are modeled as rewrite rules. The variables of the Term Rewriting System are all the variables declared in the Verilog design (i.e. inputs, outputs, wires, registers). Every module in Verilog is modeled by a function symbol. The parameters of the function are all the variables of the module as well as other module instantiations. Module instantiations are denoted by function symbols. So a module is a term, with subterms as variables and other module (function) calls. A term is rewritten till no more rewriting is applicable to it. It is then said to be a normal form. Since our TRSs model Verilog designs, we allow for variables to have a specific bit width and be bitwise addressable. Our rewriting is directed toward obtaining the symbolic value of all the outputs of the Verilog design. We define the normal form of the TRS with respect to these output variables. If all the output bits (specified in the bit width) have been rewritten into (i.e they appear on the right hand side of a rewrite rule), the term corresponding to the output variables is said to have reached a normal form. Therefore, our rewriting strategy is oriented

11

toward obtaining an expression (or symbolic value) for all the bits of all the output variables. In the case of arithmetic circuit designs, explicit directing of the rewriting is usually not necessary. This is because these circuits do not have multiple paths leading to the outputs. In well-behaved systems, the Verilog programs are deterministic and will produce a single value for each of the outputs. The corresponding TRS will have a unique normal form. The rewriting for such systems is terminating. The intuition for the proof is that the lexicographic path ordering for the TRS is the decreasing

number of unknown bits of a variable in every successive rewrite. The TRS terminates since, for every rule Ð

  Ö and substitutions

, the condition Ð

Ö holds.

The termination function itself is a monotonic homomorphism [10]. Although the rewrites are directed toward obtaining values for the output variables, if there is more than one output variable as a subterm of a term, there may be more than one rule that is applicable to that term. In other words, two output variables may concurrently get rewritten into, thereby forming a critical pair. The well-behaved arithmetic circuits we are looking at, are designed to produce a unique result as an output. Therefore, the critical pairs will derive the same term, (joinable) as mentioned in the termination proof argument. This term may be the final output (or the unique normal form) or intermediate points of join. Since all the critical pairs of the rewrite system can be shown to be joinable, the TRS is locally, and hence globally confluent. Since we prove that our systems are terminating and confluent, we prove that the TRSs translated from Verilog HDL are convergent.

12

3.0.1 Example Term Rewriting System We illustrate an example that codifies a Verilog program into a TRS.

module addmux(inA, inB, opt, sel, out); input inA, inB, opt, sel; output out; reg out; wire addout; adder add1 (addout, inA, inB); if (sel) out = addout; else out = opt; endmodule

module adder (S, A, B); input A, B; output S; assign S = A ˆ B; endmodule

13

The terms in the TRS for the addmux module are: addmux (inA, inB, opt, sel, out, addout, add1(addout, inA, inB)) add1 (S, A, B) The rules in the TRS are: addmux (inA, inB, opt, sel==1, out, addout, add1)

     

addmux (inA, inB, opt, sel, addout, addout, add1)

addmux (inA, inB, opt, sel==0, out, addout, add1) addmux (inA, inB, opt, sel, opt, addout, add1)

adder (S, A, B) adder (A B, A, B)

Each subterm obtains its actual parameters from the calling term by a rule. To faithfully model the Verilog semantics, two rules are used to model every subterm (module instantiation). These rules, that associate a module instance with the set of variables that form its actual parameters, are obtained by a topological sorting of the Verilog code. Since every module instance (subterm) communicates with only the instantiating module (calling term), the updates to the variables that have been passed to the subterm have to be reflected in the variables of the calling term. Therefore, every module instantiation is also associated with an updating rule. We assume that the input Verilog is race-free (i.e. no multiple parallel assignments for the same signal), and loop-free (i.e. no cyclic dependencies between combinational always blocks). The resulting structural TRS will then be conver-

14

gent, i.e. confluent (due to race-free assumption) and terminating (due to loop-free assumption). Note that for this structural TRS, the Verilog RTL operators are uninterpreted; the structural TRS is only used to construct terms defining the values of signals in terms of other signals.

15

Chapter 4 Equivalence of Term Rewriting Systems

Our goal is to prove the equivalence of an implementation and a specification design. We assume these designs are (or can be translated into) combinational Verilog RTL modules which define a mapping from their inputs to outputs. The equivalence of these mappings is the target of the analysis. This target is shared with the combinational equivalence checking tools which are now ubiquitous. The monolithic verification problem is intractable in general and (similar to equivalence checkers) we use signal£ names in the two modules as guidance in decomposing this equivalence proof.

4.1 Definition of theory of equivalence of TRSs
We now need to define the notion of equivalence we wish to check. For a term

Ø we define the ×ÙÔÔÓÖشص to be the set of signal functions and variables in Ø as ×ÙÔÔÓÖØ´×µ ×ÙÔÔÓÖشص and for
of

Ø. For two terms × and Ø we define ×
all ground substitutions modules
£ We

×ÙÔÔÓÖØ´×µ, we have ×



Ø È Á´ µ

℄.

Given

and Ê (the golden and revised modules where È Á

È Á ´Êµ

use the word signal to refer to Verilog variables in RTL modules, and reserve variable for variables in TRSs

16

and È Ç

´ ؾ

È Ç´ µ

È Ç´Êµ) and a set of signal functions , define Ê

as

¨ ´Ø µ ¨Ê ´Ø µµ. Note that in order for this definition to be useful
and Ê have the same signal

we assume that equivalent signals in the two modules

name. In cases of multiple assignments to the same signal in either module, we may need to adjust the names assigned to the multiply-assigned signals in order to ensure correspondence (we will present our approach to dealing with this problem later in this section). We wish to prove Ê
ÈÇ

. Attempting to automatically prove this mono-

lithically is prohibitively expensive in general, so we compute a set of comparison point signal functions and make use of the following property to transfer the

result to the proof of equivalence for È Ç: Theorem 1.

Ê

Ç ´´´Ê

µ ´Ç

µµ µ ´Ê

Ç

µµ

Proof Outline. Take an arbitrary signal function × term ¨Ê ´×

¾ Ç. We first observe that the

ǵ is equal to the iterative expansion beginning with the term ¨Ê ´× µ Ü

where in each step, the signals terms

¾ Ò Ç are substituted by the corresponding
¨ ´× ǵ.

¨Ê ´Ü µ.

A similar observation holds for

Then for any ground

substitution for

Ç, one can prove by induction following the iterative expansions ǵ and ¨ ´× ǵ using
to relieve the induction hypothesis and substituting equals

we observed, that the desired equality holds between ¨Ê ´× the assumption Ê

for equals along the way. Thus, we prove Ê
ÈÇ

by proving Ê

instead where È Ç

. The

following additional (trivial) property is useful in composing 17

proofs together:

Theorem 2.

ÊÁ

´´´Ê

ÈÇ

Á µ ´Á Ê
ÈÇ

ÈÇ

µµ µ ´Ê

ÈÇ

µµ

Our procedure for proving

consists of the following two steps, . We compute as

compute a set of comparison points follows. We first include all signals in

and then prove Ê

which have the same names in the golden

and revised designs (including the required equivalent names for inputs and outputs). Comparison points for signals that have the same base name but are multiplyassignedÝ in either design, are determined by the following heuristic. We analyze the set of bits which will be assigned a non-constant value in each assignment to the base signal. We rename the assigned signals to match-up the assignments which assign the same number of non-constant bits in the different designs. This is simply a heuristic which appears to work well for arithmetic circuits. The user can bypass this heuristic by having unique signals in every assignment and only introducing comparison points for signals with the same name. We now turn our attention to proving , we iterate through each signal

Ê

. In order to check

Ê

Ü

¾

, and compute

×

¨Ê ´Ü µ, compute

Ø ×

¨ ´Ü µ, and check if × Ø. We need to have some mechanism for checking
Ø for two given terms. This is achieved using a separate fixed set of rewrite

rules which codify various identities about the RTL operators. For example, one may introduce an absorption and association rule for &, as well as rules for reducing arithmetic and left-shifts: (x & x) ---> x
Ý have multiple assignments

18

((x & y) & z) ---> (x & (y & z)) (x << 3) ---> (+ (x << 2) (x << 1) (x << 1)) (- (x << 1) x) ---> x ((x << 1) << 1) --> (x << 2) We denote this term simplification with the function Ö

Ù
´Øµ which maps

a term Ø to a reduced term which is equal under all substitutions to Ø. We then deduce

×

Ø when Ö Ù
´×µ

Ö Ù
´Øµ. We will discuss this term reduction function in Ø.

a later section, but we note that the procedure is not complete in determining × Instead Ö

Ù
is designed to be efficient and sufficient for the domain of circuits to

be analyzed. The decomposition of the equivalence check using comparison points and incremental refinement lessens the requirements on efficiency and sufficiency for the function Ö

Ù
.

We now demonstrate how the procedure works for the following simple golden and revised designs: module G(input in, output reg out); always@* begin out = in << 1; out = out << 1; out = out << 1; out = out << 1; end endmodule 19 // out1 // out2 // out3 // out4

module R(input in, output reg out); always@* begin out = in << 2; out = out << 2; end endmodule // out2 // out4

Since out is multiply-assigned in both modules, our heuristic analyzes the set of non-constant bits in each assignment and deduces the comparison points defined by the names in the comments to the right of the assignments above. We then have the following set of comparison points

Ò ÓÙؾ ÓÙØ . The

comparison then proceeds by checking out2 and out4. For out4, we get the following terms:

¨Ê ´ÓÙØ µ ´ÓÙؾ´µ ¾µ and Ø ¨ ´ÓÙØ Êµ ´´ÓÙؾ´µ ½µ ½µ. We get Ö Ù
´×µ × and Ö Ù
´Øµ × and thus × Ø. The check for out2 proceeds in the similar fashion with in() instead of ÓÙؾ´µ.
×

4.2 Alternative Definition and Proof for Equivalence of TRSs
Let ̽ and ̾ be two term rewriting systems with input alphabet ¦½ and ¦¾ . We define an observation function for each system. Let ǽ be a mapping from ¦½ to Ó

×´¦½ µ and Ǿ be a mapping from ¦¾ to Ó ×´¦¾ µ, where Ó ×´¦½ µ and Ó ×´¦¾ µ ̽ and ̾ , every

represent the set of all observable terms in both TRSs. Assuming a mapping is provided between the terms of

20

observable term

ǽ

Ó ×´¦½ µ that maps to the term Ǿ

Ó ×´¦¾ µ form an ob-

servable pair. The observable terms are typically the outputs in a Verilog design. Since the variables need to be bit addressable, we define a function

´Ü

¼℄µ that accepts a variable as its argument and returns
comparison points. The Ø comparison point of two TRSs

, the index of the last bit

that has a value. We compare the observable terms of the two TRSs at specific

̽ and ̾ , corresponds to the Çؾ µ de-

rewrites in the respective TRSs, such that for every observable pair ´Çؽ fined in ̽ and ̾ ,

´Çؽµ

´Çؾµ. In other words, the observable terms (output

variables) in the two systems are compared when the same number of bits of an output variable have been updated in both the systems. The symbolic expressions obtained by rewriting these bits of the variable in both the systems are compared against each other. Theorem 3. Let each terms. Let ǽ ´

ǽ Ò ¼℄ and Ǿ Ò ¼℄ be an observable pair as defined above,

Ò bits wide. Let ǽ ´ µ and Ǿ ´ µ denote the initial (bottom) value of the ƽ and ƾ . Then, every comparison point is after «
¼

µ  « ǽ ℄ and Ǿ´ µ  ¬ Ǿ ℄ in ̽ and ̾ respectively until
Ô are all the comparison points,

it reaches the normal forms rewrites in and

̽ and ¬ rewrites in ̾ . If Ǿ ℄ at every

ǽ ℄

½

½¼, the two TRSs ̽ and ̾

are

observationally equivalent. Proof: At
¼,

ǽ ℄ is equal to Ǿ ℄, or the initial (bottom) values of the two
½,

TRSs are equal. From the definition of comparison points, at 21

ǽ and Ǿ will

have rewritten to the same number of bits. Let this value of bits be . We are given that at
½,

ǽ equals Ǿ . So the first

bits are proved equal. At every

successive comparison point, the next forms are reached. The normal form normal form ƾ is the term where point that compares

bits are proved equal, until the normal

ƽ is the term where ´Ç½ µ Ò.

Ò and the

´Ç¾ µ

Ô corresponds to the comparison

ƽ and ƾ . Therefore, we can prove by induction over the

comparison points, that all by bitwise equivalence of ǽ and Ǿ , the two TRSs ̽ and ̾ are observationally equivalent.

4.3 Computing Comparison points
Since our TRSs model Verilog designs, we allow for variables to have a specific bit width and be bitwise addressable. Our rewriting is directed toward obtaining the symbolic values of the set of all the observed variables of the Verilog design. In the case of arithmetic circuits, this range typically corresponds to the outputs of the design. Therefore, our rewriting strategy is oriented toward obtaining an expression (or symbolic value) for all the bits of all the output variables. In the case of arithmetic circuit designs, explicit directing of the rewriting is usually not necessary. This is because these circuits do not have multiple paths leading to the outputs. In other words, the observable terms (output variables) in the two systems are compared when the same number of bits of an output variable have been updated in both the systems. At every rewriting step, the set of observed variable (range of observation function applied on the two designs) of the two TRSs are compared. If, as a result of the rewrite, the symbolic values 22

of the same number of bits have been obtained, the rewrite step is considered a comparison point. The expressions for the two sets of observed variables are now checked for equivalence. Consider, for example, a Carry Lookahead Adder (CLA) design being verified against a golden Ripple Carry Adder (RCA) design. When the observation function is applied to these designs, the range will be the output variables of the design. So, the Sum and Carry variables will be the observed variables. In the corresponding TRSs for these two systems, let us look at the number of rewriting steps taken by each system to rewrite a (symbolic) value for a bit in the Sum and Carry variables. In the RCA TRS, in a single rewriting step, the values for one bit of Sum and Carry are obtained. In the CLA TRS, a single rewriting step produces the values for four bits of the Sum and Carry variables. So, a comparison point is identified after one rewrite step in the RCA and four rewrite steps in the CLA. At every such comparison point that is identified, the symbolic expressions of the two TRSs are compared and checked for equivalence. The computation of comparison points automatically is an important contribution of our technique. The computation of comparison points is shown in greater detail in the proofs outlined in Chapter 6.

23

Chapter 5 Verifire : A fully automated proof generator

Verifire is a fully automated tool which implements the generic proof technique described in Chapter 4. There are two distinct parts to the tool, viz., a Verilog to Rewriting Systems translator (Vtrans) and a proof engine (Vprover). Vtrans is a compiler which accepts synthesizable Verilog as input and translates it to a Term Rewriting System. The translator automatically identifies the module hierarchy and constructs a TRS for the entire design. Vprover automatically generates proofs by using the notion of TRS equivalence between two Term Rewriting Systems. The reference TRS and the revised TRS are inputs to the proof engine along with the observation function of each TRS. The mapping between the terms in the two TRSs is also predefined. With these inputs, the tool automatically generates a proof, or returns an error trace if it cannot establish the proof. The Vprover is an iterative engine that checks if the condition for all rules is true in every iteration. If a condition is satisfied, the corresponding term is rewritten. If the conditions for more than one rule are satisfied, then the rewrites occur concurrently. Vprover computes the intermediary comparison points automatically. Us24

ing a set of directives, it generates proofs for the observed terms at every comparison point. The symbolic values of the observed terms are compared by comparing the expressions that have been generated by the rewrites/substitutions. In order to establish expression equivalence, the tool maintains a database of statically pre-verified expression minimizations. This set of minimizations is not complete. Additions need to be made whenever the rewriter cannot minimize an expression due to an insufficient database of minimization heuristics. Verifire was implemented in C++ and was used to prove many multiplier circuits. The tool can automatically generate proofs for standard multiplier designs like the Booth multiplier, Array multipliers and Tree multipliers. It can also automatically generate proofs for multiplier designs that are modifications of these standard designs.

25

Chapter 6 Arithmetic Circuit Verification

In this section, we show the application of our technique and the efficiency of our tool in the domain of (integer) arithmetic circuits. Arithmetic circuits can be classified broadly into adders, shifters, comparators and multipliers. We illustrate the technique with respect to adders, and show how it works for modified shifters and comparators. We present all our experimental results with regard to multipliers. Due to the space constraint, we do not present the generated proofs for all the circuits mentioned, but present the adder proof as a representative example.

6.1 Adders
We illustrate our verification technique for verifying the functionality of a 16-bit Carry Lookahead Adder. We use a simple ripple carry adder as the golden design for adders. It adds two vectors by doing a bitwise xor and generates a corresponding carry. The Verilog code for a 16-bit ripple carry adder design is shown below.
module rca16bit(A, B, Cin, S, Cout); input [15:0] A, B; input Cin;

26

output [15:0] S; output Cout; reg S, Cout; wire [15:0] Carry; rca1bit rca1bit0(A[0], B[0], Cin, S[0], Carry[0]); R1,R2

rca1bit rca1bit1(A[1], B[1], Carry[0], S[1], Carry[1]); R3,R4 . . . rca1bit rca1bit15(A[15], B[15], Carry[14], S[15], Cout);R31,R32 endmodule module rca1bit(a, b, cin, s, cout); input a, b, cin; output s, cout; assign cout = a&b endmodule assign s = a b&c c&a; b c; R33 R34

The terms in the TRS for the above modules are
rca16bit(A, B, Cin, S, Cout, Carry, rca1bit0(A[0], B[0], Cin, S[0], Carry[0]), rca1bit1(A[1], B[1], Carry[0], S[1], Carry[1]) rca1bit2(A[2], B[2], Carry[1], S[2], Carry[2]) . . . rca1bit15(A[15], B[15], Carry[14], S[15], Cout)) rca1bit(a, b, cin, s, cout).

Each instance of therca1bit () function is a subterm, with its own set of actual parameters. Every such subterm obtains its actual parameters from the calling term by a rule. These rules, that associate a module instance with the set 27

of variables that form its actual parameters, is obtained by a topological sorting of the Verilog code. Since every module instance (subterm) communicates with only the instantiating module (calling term), the updates to the variables that have been passed to the subterm have to be reflected in the variables of the calling term. Therefore, every module instantiation is also associated with an updating rule. Since there are 16 module instantiations in the RCA Verilog code, there will be 32 associated rules. Rules 33 and 34 correspond to the transitions that are defined for the rca1bit() term. These rules can rewrite any particular instance of this term, since they have been defined for it.
Rule 33: rca1bit(a, b, c, s, cout) Rule 34: rca1bit(a, b, c, s, cout)

   

rca1bit(a, b, c, a

b

c, cout)

rca1bit(a, b, c, s, a&b

b&c

c&a)

The above translation is done automatically by Vtrans, the translator in our tool. The target design, a Carry Lookahead Adder (CLA) is similarly translated from its Verilog implementation to a TRS. The Verilog code for the CLA is shown below.
module cla16bit (A, B, Cin, S, Cout); input [15:0] A, B; input Cin; output [15:0] S; output Cout; reg S, Cout; wire C3, C7, C11;

28

fastcarry fc (A, B, Cin, C3, C7, C11); cla4bit cla0 (A[3:0], B[3:0],Cin, S[3:0]); cla4bit cla1 (A[7:4], B[7:4], C3, S[7:4]); cla4bit cla2 (A[11:8],B[11:8],C7,S[11:8]);

R1,R2 R3,R4 R5,R6 R7,R8

cla4bit cla3 (A[15:12], B[15:12], C11, S[15:12]); R9,R10 endmodule module cla4bit (a, b, cin, s); input [3:0] a, b; input cin; output [3:0] s; wire [3:0] c; assign c[0] = g[0] assign c[1] = g[1] p[0]&cin; g[0]&p[1] R11 R12

p[1]&p[0]&cin; assign c[2] = g[2] g[1]&p[2] R13

g[0]&p[2]&p[1] p[2]&p[1]&p[0]&cin; assign c[3] = g[3] g[2]&p[3] R14

g[1]&p[3]&p[2] g[2]&p[3]&p[2]&p[1]; p[3]&p[2]&p[1]&p[0]&cin; assign s[0] = a[0] assign s[1] = a[1] assign s[2] = a[2] assign s[3] = a[3] b[0] b[1] b[2] b[3] c[0]; c[1]; c[2]; c[3]; R15 R16 R17 R18

29

PGgen pg0 (a[0], b[0], p[0], g[0]); PGgen pg1 (a[1], b[1], p[1], g[1]); PGgen pg2 (a[2], b[2], p[2], g[2]); PGgen pg3 (a[3], b[3], p[3], g[3]); endmodule module PGgen (a, b, p, g); input a, b; output p, g; assign p = a b;

R19,R20 R21,R22 R23,R24 R25,R26

R27 R28

assign g = a & b; endmodule

The cla4bit module is called times by the main module. There are four cla4bit blocks in the design. The cla4bit module computes four successive carries at a time. The sum for the corresponding four bits is calculated in this module. The cla4bit module also calls the PGgen module, that generates the Ps(propagated carries) and the Gs (generated carries) for the block. A module called fastcarry is called to calculate the input carry values (Cin, C[3], C[7], C[11]) for each of the four cla4bit blocks. The terms in the CLA TRS are:
cla16bit(A, B, Cin, S, Cout, cla4bit0(A[3:0],B[3:0],Cin,S[3:0]), cla4bit1(A[7:4],B[7:4],C3,S[7:4]), cla4bit2(A[11:8],B[11:8],C7,S[11:8]), cla4bit3(A[15:12],B[15:12],C11,S[15:12]),

30

fc()) cla4bit(a, b, cin, s, c, PGgen0(a[0], b[0], p[0], g[0]), PGgen1(a[1], b[1], p[1], g[1]), PGgen2(a[2], b[2], p[2], g[2]), PGgen(a[3], b[3], p[3], g[3])) PGgen(a, b, P, G) fc (A, B, C, C’, C’’)

We have shown the rules that are generated from the Verilog by Vtrans as labels in the Verilog code, for the sake of clarity. As explained in the case of the RCA, every module call in Verilog generates two rules- one for instantiation and the other for updating. Rules 1 to 10 correspond to these rules in the cla16bit module. Rules 7 to 18 are defined for the cla4bit module. These compute the values of c[3:0] and use it to calculate s[3:0] as the R.H.S. Rules 19 to 26 are rules pertaining to module calls for the PGgen() module. We have not shown the rules pertaining to the fast carry block, fc, since the carries are calculated using the same type of rules as used in the cla4bit block. We define an observation function for both the TRSs, whose range are the variables S and Cout. The comparison points for the two designs are computed as the transitions whose R.H.S is a bit of the observed variables, S and Cout. (We assume a mapping between the two TRSs that gives the name correspondence of the variables of interest). It must be noted that in the ripple carry adder TRS, every rewriting step (that includes the rules that instantiate and update the variable) updates only one bit of the sum, S. For instance, rules 1 and 2 in the RCA form 31

RCA

¯

S[0] R1,R2

¯

S[1] R3,R4

¯

S[2] R5,R6

¯

S[3] R7,R8

¯

S[7:4]

R5,R6

¯

S[11:8]

R5,R6

¯

S[15:12]

R5,R6

¯

¯ Æ
CLA

¯ Æ
R5,R6
S[7:4]

¯ Æ
R5,R6
S[11:8]

¯ Æ
R5,R6
S[15:12]

Æ

R1,R2,R3,R4

S[3:0]

Æ

Æ

Æ

Æ

Figure 6.1: Proof of correctness of the Carry Lookahead adder compared against the Ripple Carry Adder. ¯ represents terms of the RCA. Æ represents terms of the CLA. The variable within is the observable variable updated after a set of rewrites. represents the expression equivalence between the observable terms of the two systems at each comparison point. The rewriting and the expression equivalence form the two engines of the Vprover. a single rewrite step that updates S[0]. However, in the CLA, every rewriting step (Rules 1 to 10 in the TRS for CLA) updates four bits of the sum, S. Therefore, a single step in CLA corresponds to four steps in the ripple carry adder. Comparison can take place only at the point where four bits of the ripple carry adder have been obtained. Vprover, the expression equivalence checker in our tool uses these directives to compute the comparison points automatically. At the first comparison point, after S[3] is obtained in the two TRSs, the expressions contained in S[3:0] are compared. Since the rewriting in both the systems is directed toward obtaining the observed variables, the trace of the rewriting steps that rewrite these variables is maintained. Vprover uses a set of minimization heuristics to compare expressions and compare equivalence. It is intuitive to understand how the expressions generated are equivalent, by tracing the rewrite steps in both the TRSs that lead to the observed variable. A correspondence between the rules for the two TRSs is given below. Figure 6.1 explains this in an intuitive

32

manner. Rule 33 in the RCA TRS corresponds to Rules 15 to 18 in the CLA TRS, since the four bits of the sum are computed as an xor of the corresponding input operand and carry bits. However, the input carry terms in the two TRSs are different. The expression for Carry[2] in the RCA is obtained by applying Rules 1 to 6. The corresponding value in the CLA TRS, c[3], is computed by Rules 11 to 14. Expanding the rewrites of Rules 19 to 26, we get the value of the this fourth stage carry in terms of the Ps and Gs of the other previous stages. Rules 27 and 28 give the same expression in terms of bits of A and B, instead of Ps and Gs. Therefore, the expressions of the carry in both the TRSs turn out to be exactly equal. Once S[3] is verified, a similar procedure is used to verify S[7], S[11] and S[15]. The normal form of both the TRSs is reached when S[15] is computed. The two TRSs are thereby proved equivalent.

6.2 Shifters and Comparators
The technique can be used to verify shifters and comparators also. We have verified a modification of the shifter design The modification involves the use of a mask that is dependent on the amount of shift. The result of a right rotator is ANDed with this mask to produce a shift right logical result. In this case, our golden design was the regular mux-based right shifter. Our tool managed to verify this design with no additional heuristics needed by Vprover to establish the expression equivalence. Although comparator designs are not particularly complex to verify, we have also verified disparate designs of the comparator, like the arithmetic comparator (using subtractors) comparator (using a tree of xor gates).

33

Chapter 7 Multiplier Verification

We consider the space of multipliers divided into standard and non-standard multipliers. The standard multipliers are the widely used, common multiplier designs like Booth, Wallace tree, Dadda Tree and Array multipliers. The non-standard multipliers have incremental optimizations made to these standard multiplier designs. We have extended our technique to cover the space of these two categories of multipliers. We illustrate our technique on the Booth multiplier and BISMUL, an optimization of the Booth multiplier.

7.1 Booth Multiplier
The multiplier in Figure 7.1 is a 64-bit, radix-3, non-overlapping Booth multiplier. The ppgen block generates the eight partial products that form the Booth encoding. The ppsel block selects the relevant partial product depending on the incoming bits from the multiplier. The partial products are added in the adder block. The ppgen is given the shifted multiplicand as input (shift3) to generate the current partial product. This method is repeated for all the bits of the multiplier. The result appears in the product register. The entire Verilog code for the Booth multiplier is given in Appendix C.

34

Multiplier (mplier)

Multiplicand (mcand)
¾Ò

Ò
0 1 shift3

Ò
shift3 1 0

Shifted Multiplier
¿

Shifted Multiplicand

¢ ¾Ò
select
¾Ò

¾Ò

¾Ò

Partial Product Generator (ppgen)
¾Ò ¾Ò

Adder (adder)

Product Register (product)
¾Ò

Figure 7.1: Architecture of a Booth multiplier.

To prove the functional correctness of the above design, we follow the technique explained in Chapter 2. We illustrate the proof using the outline of the proof provided in that section. We use a simple Shift-and-Add multiplier as the reference TRS for multipliers. It performs multiplication by generating partial products. It shifts the multiplicand left by one bit after every partial product calculation. The partial product of the current stage is set to the sum of the previous partial product and the shifted multiplicand of the current stage or 0, depending on whether the multiplier bit corresponding to the current stage is 1 or 0. The Verilog code of the Shift-and-Add calls a shift and an add module iteratively. The entire Verilog code for the Shift-and-Add multiplier is given in Appendix B. The target design here is the Booth multiplier discussed above. Vtrans ex35

multiplier bits 000 001 010 011 100 101 110 111

partial product generated Ü : multiplicand pp0: ¼ pp1: Ü pp2: Ü ½ pp3: Ü ½·Ü pp4: Ü ¾ pp5: Ü ¾·Ü pp6: Ü ¾·Ü ½ pp7: Ü ¿ Ü

Table 7.1: Partial product terms of the booth multiplier.

tracts its TRS from the Verilog code. In the case of the Booth multiplier, the PO needed to prove ´Ê
ÈÇ

µ is

product as explained in Chapter 2. A sketch of this proof (as output by the tool) follows. The first comparison point in the proof is after ¿ bits of output (product) are updated in both TRSs. This is because the Booth updates ¿ bits of its product simultaneously as opposed to Shift-and-Add that updates its product sequentially. The output of the tool after the first comparison point is as follows. Stage in the tool output represents the -th update of product.£
Comparison Point 1: rules in the output are reproduced in pseudo-Verilog syntax and they correspond to the rewrite rules of the TRS as described in Chapter 2. For example, rules Reference.Stage 1.Rule x and Reference.Stage 1.Rule y in the TRS would be the rewrite rule product() ---> product() + if (y[0](), mcand(), 0).
£ The

36

Reference Model: Stage 1. Rule x: Rule y: Stage 2. Rule x: Rule y: Stage 3. Rule x: Rule y:

Shift-and-Add

product = product + mcand product = product + 0

if(y[0]) if( y[0])

product = product + mcand<<1 product = product + 0

if(y[1]) if( y[1])

product = product + mcand<<2 product = product + 0

if(y[2]) if( y[2])

Revised Model: Stage 1. Rule a: Rule b: Rule c: Rule d: Rule e: Rule f: Rule g: Rule h:

Booth

product = product + 0 product = product + mcand product = product + mcand<<1 product = product + mcand<<1 + mcand product = product + mcand<<2 product = product + mcand<<2 + mcand

if ( y[0]& y[1]& y[2]) if (y[0]& y[1]& y[2]) if ( y[0]&y[1]& y[2]) if (y[0]&y[1]& y[2]) if ( y[0]& y[1]&y[2]) if (y[0]& y[1]&y[2]) if ( y[0]&y[1]&y[2]) if (y[0]&y[1]&y[2])

product = product + mcand<<2 + mcand<<1 product = product + mcand<<3 - mcand

The expressions generated from both the TRSs from the first comparison point are displayed with their corresponding rules. For instance, Reference. Stage 1.Rule x is product = product + mcand if (y[0]). Correspondence at the comparison point is established by a case-by-case analysis of the rules. Every encoding of the Booth multiplier is compared to the

37

Shift-and-Add, with the same conditions. For instance, Revised Stage 1.Rule a gives the expression for the partial product generated for the Booth encoding ¼¼¼. Applying the condition y[0]y[1]y[2] = 000 on the Shift-and-Add, corresponds to rules Reference.Rule 1y, Reference.Rule 2y, and Reference. Rule 3y. Such an analysis is performed for all cases. The reduce() function is applied by Vprover to simplify corresponding terms at Comparison Point 1. An outline of this simplification as output by the tool is given below.
Correspondence Revised Rule 1a == Rule 1b == Rule 1c == Rule 1d == Rule 1e == Rule 1f == Rule 1g == Rule 1h == (after case analysis): Reference Rule 1y, Rule 2y, Rule Rule 1x, Rule 2y, Rule Rule 1y, Rule 2x, Rule Rule 1x, Rule 2x, Rule Rule 1y, Rule 2y, Rule Rule 1x, Rule 2y, Rule Rule 1y, Rule 2x, Rule Rule 1x, Rule 2x, Rule

3y 3y 3y 3y 3x 3x 3x 3x

The proof now proceeds to subsequent comparison points iteratively till the output is obtained in its normal form. Therefore explains the proof in an intuitive manner.

Ê

is proved. Figure 7.2

7.2 BISMUL
In order to improve the performance of multipliers, more complicated algorithms and designs are used. We consider a high-performance multiplier, BISMUL [30], that is a modification of the Booth multiplier. A radix-3 Booth multiplier 38

Compare Point 1
Shift&Add

Compare Point 2
P[5:3] R×

Compare Point 3
P[8:6] R×

Compare Point 21
P[63:61] R×

Compare Point 22
P[64]

¯

P[0] R×

¯

P[1] R×

¯

P[2] R×

¯

¯

¯

¯



¯

¯ Æ
Booth

¯ Æ
R

¯ Æ
R

¯ Æ
R

¯ Æ
P[64]

Æ

R

ÓÓØ

P[2:0]

Æ

ÓÓØ

P[5:3]

Æ

ÓÓØ

P[8:6]

Æ

ÓÓØ

P[63:61]

R Æ ÓÓØ Æ

Figure 7.2: Proof of correctness of the Booth multiplier compared against the Shift&Add multiplier. ¯ represents terms of the Shift&Add multiplier. Æ represents terms of the Booth multiplier. R× represents the rules of the Shift&Add multiplier at every stage (Rule x and Rule y). R ÓÓØ represents the corresponding Booth multiplier rules (Rule a ...Rule h). The variable within is the observable variable updated after a set of rewrites. Here it is product. represents the expression equivalence between the observable terms of the two systems at each comparison point. The rewriting and the expression equivalence form the two engines of the Vprover.

ProductShift Register

Multiplier Register n 16 16 16 16

Multiplicand Register n

2n PPSEL 0 Product PPSEl 1 PPSEL 2 PPSEL 3

Partial Product Generator

Partial Product Registers

8m 8 to 1 MUX 8 to 1 MUX 8 to 1 MUX 8 to 1 MUX m=n+x x = 0 or 1 or 2 or 3 Carry Save Adder (CSA)

Figure 7.3: Architecture of a BISMUL.

39

architecture using 3-bit scan with no overlap invariably generates dummy bits in the last 3-bit-scan. In BISMUL, this last 3-bit-scan is moved to the first 3-bit scan, so that the dummy bits can be used for odd multiple generation. The sequence for bit scanning is shown in Table 7.2. The improvement in the multiplication speed in BISMUL is obtained by executing several Partial Product Selectors (PPSELs) in parallel. The shifting sequence of the multiplier decides the inputs to the PPSELs. These selected partial products are summed through carry-save additions. The architecture of a

¢

bit multiplier is shown in Figure 7.3. This

architecture comprises Product Registers (PR), Partial Product Generators (PPG), Partial Product Selectors (PPS), Multiplexers and a Carry Save Adder (CSA). PPG is implemented according to Table 7.2. PPS consists of four PPSELs. Each PPSEL has 16-bit inputs, whose sequence is shown in Table 7.3. The operation of the BISMUL is as follows. In the first cycle, PPG generates eight partial products and each PPSEL selects one partial product. The two dummy bits in the lower bit position in the first three bits of PPSEL cause the selection of either 0, or four times the multiplicand (000 or 100). The partial products are added and stored in the PR. The partial products get generated in the first cycle. Subsequent cycles perform the same operation as described. The entire Verilog code for the BISMUL multiplier is given in Appendix D. We prove the BISMUL correct by using the following technique. We perform a series of reductions on the BISMUL to reduce it to its standard design, the Booth multiplier. The standard Booth multiplier is already verified using the above technique. Hence, the given non-standard design can be proven correct. 40

bit 000 001 010 011 100 101 110 111

Generation of Partial Product Terms P0: 0 P1: multiplicand P2: shift multiplicand left by 1 P3: add P1 and P2 P4: shift multiplicand left by 2 P5: add P1 and P4 P6: shift P3 to the left by one P7: subtract P1 from 8

Table 7.2: The partial product terms in a BISMUL

PPSEL PPSEL0 PPSEL1 PPSEL2 PPSEL3

Inputs Multiplier[54:52],[42:40],[30:28],[18:16],[6:4][0] Multiplier[57:55],[45:43],[33:31],[21:19],[9:7][1] Multiplier[60:58],[48:46],[36:34],[24:22],[12:10][2] Multiplier[63:61],[51:49],[39:37],[27:25],[15:13][3] Table 7.3: The inputs of each PPSEL

41

Verifire extracts the corresponding TRSs from the BISMUL and Booth Verilog code. The tool compares the modules in the non-standard design (that derive from the modules in the standard design) to the corresponding modules in the standard design. The correspondence (and equivalence) between these “derived” modules of the non-standard design, and the modules in the standard design is established by the same method as described in Section 7.1 between the Booth and the Shift-and-Add. In order to prove the reduction of BISMUL to Booth, it is enough to prove that the changed (terms) modules in BISMUL are equivalent to the original Booth terms. In this case, the terms ppsel0, ppsel1, ppsel2, ppsel3, mux8to1 of the BISMUL form the revised design. The terms ppsel, shift3 of the Booth act as the corresponding reference design. Similarly, the terms productshift, carrysaveadder of BISMUL correspond to the ppsel,adder terms of Booth. Therefore, it is sufficient to prove the validity of each correspondence. We have verified the BISMUL using our technique. We have also verified the Wallace Tree multiplier Section 7.3. On similar lines, Array and Dadda Tree multipliers can also be verified using our technique. For each of these, we can also verify some modifications to the standard designs. The terms in the TRS for the modified design are simplified to terms in the TRS for the standard design. The simplification is performed using the database of rules in Vprover. This set of rules is not exhaustive and may require manual intervention when presented with an entirely new design that does not build on the standard ones. However, for a large space of designs, it is completely automated. 42

Multiplier (y) 4

Multiplicand (mcand) 4

Partial Product Generator 8 Carry Save Adder tree 8 3:2 CSA 8 8 3:2 CSA 8 Fulladder 8 Product 8 8

Figure 7.4: Architecture of a -bit Wallace Tree Multiplier.

7.3 Wallace Tree Multiplier
We have verified a

¢

Wallace Tree multiplier using our technique. We

show a verification outline in this section. In our design of the Wallace Tree, the partial products are generated without Booth encoding, so as to demonstrate the efficacy of the technique on disparate designs. This also means that the terms generated by the Wallace Tree multiplier TRS for the partial products are more complicated and large than the radix-3 Booth encoded multiplier discussed in Section 7.1. In the interest of readability of this proof, we demonstrate an illustrative version, that verifies a

¢

Wallace Tree multiplier. This proof can be extrapolated along the

same lines, to prove the correctness of the 43

¢

Wallace Tree multiplier.

Figure 7.4 shows a

¢

Wallace Tree multiplier. The ppgen block gener-

ates 4, 8-bit partial products (one corresponding to each bit of the multiplier). The partial products are added in a

¿ Carry Save Adder and a full adder.

The Shift-and-Add multiplier is used as the golden design in this proof. The working of the Shift-and-Add is described in Section 7.1. Vtrans translates the golden and the target designs into their corresponding TRSs. In the case of the Wallace Tree multiplier, (as in the case of the Booth), the PO needed to prove ´Ê
ÈÇ

µ is product as explained in Chapter 2. A sketch

of this proof (as output by the tool) follows. In this proof the comparison points are not generated for intermediate comparison and rewriting of the terms. This is because the Wallace Tree design that we have chosen does not compute bits of the product partially. So, in the case of the

¢

multiplier, the Wallace Tree TRS terms are rewritten into a large, composite

term. The comparison point is after 64 steps of rewriting in the Shift-and-Add TRS and a single, monolithic rewrite step in the Wallace Tree TRS. For the current illustration of the proof on a

¢

multiplier, the terms in the two TRSs are compared

after 4 rewriting steps in the Shift-and-Add and one monolithic step of the Wallace tree.
Comparison Point 1: Reference Model: Stage 1. Rule x: product = product + mcand if(y[0]) Shift-and-Add

44

Rule y: Stage 2. Rule x: Rule y: Stage 3. Rule x: Rule y: Stage 4. Rule x: Rule y: [0.2in] Stage 1. Rule a: Rule b: Rule c: Rule d: Rule e: Rule f: Rule g: Rule h: Rule i: Rule j: Rule k: Rule l: Rule m: Rule n: Rule o: Rule p:

product = product + 0

if( y[0])

product = product + mcand<<1 product = product + 0

if(y[1]) if( y[1])

product = product + mcand<<2 product = product + 0

if(y[2]) if( y[2])

product = product + mcand<<3 product = product + 0 Wallace Tree

if(y[3]) if( y[3])

Revised Model:

product = 0 + 0 + 0 + 0

if ( y[0]& y[1]& y[2]& y[3]) if (y[0]& y[1]& y[2]& y[3]) if ( y[0]&y[1]& y[2]& y[3]) if (y[0]&y[1]& y[2]& y[3])

product = mcand + 0 + 0 + 0 product = 0 + 0 + mcand<<1 + 0

product = mcand + mcand<<1 + 0 + 0 product = 0 + 0 + mcand<<2 + 0

if ( y[0]& y[1]&y[2]& y[3]) if (y[0]& y[1]&y[2]& y[3]) if ( y[0]&y[1]&y[2]& y[3]) if (y[0]&y[1]&y[2]& y[3])

product = mcand + 0 + mcand<<2 + 0 product = 0 + mcand<<1 + mcand<<2 + 0

product = mcand + mcand<<1 + mcand<<2 + 0 product = 0 + 0 + 0 + mcand<<3

if ( y[0]& y[1]& y[2]&y[3]) if (y[0]& y[1]& y[2]&y[3]) if ( y[0]&y[1]& y[2]&y[3]) if (y[0]&y[1]& y[2]&y[3])

product = mcand + 0 + 0 + mcand<<3

product = 0 + mcand<<1 + 0 + mcand<<3

product = mcand + mcand<<1 + 0 + mcand<<3 product = 0 + 0 + mcand<<2 + mcand<<3

if ( y[0]& y[1]&y[2]&y[3]) if (y[0]& y[1]&y[2]&y[3]) if ( y[0]&y[1]&y[2]&y[3]) if (y[0]&y[1]&y[2]&y[3])

product = mcand + 0 + mcand<<2 + mcand<<3

product = 0 + mcand<<1 + mcand<<2 + mcand<<3

product = mcand + mcand<<1 + mcand<<2 + mcand<<3

45

The expressions generated from both the TRSs from the first comparison point are displayed with their corresponding rules. For instance, Reference. Stage 1.Rule x is product = product + mcand if (y[0]). Correspondence at the comparison point is established by a case-by-case analysis of the rules. Every encoding of the Wallace Tree multiplier is compared to the Shift-and-Add, with the same conditions. For instance, Revised.Stage 1.Rule a gives the expression for the partial product generated for the multiplier value ¼¼¼¼. Applying the condition y[0]y[1]y[2]y[3] = 0000 on the Shiftand-Add, corresponds to rules Reference.Rule 1y, Reference.Rule 2y, and Reference.Rule 3y. Such an analysis is performed for all cases. The reduce() function is applied by Vprover to simplify corresponding terms at Comparison Point 1. An outline of this simplification as output by the tool is given below.
Correspondence Revised Rule 1a == Rule 1b == Rule 1c == Rule 1d == Rule 1e == Rule 1f == Rule 1g == Rule 1h == Rule 1i == Rule 1j == Rule 1k == Rule 1l == Rule 1m == Rule 1n == Rule 1o == Rule 1p == (after case analysis): Reference Rule 1y, Rule 2y, Rule Rule 1x, Rule 2y, Rule Rule 1y, Rule 2x, Rule Rule 1x, Rule 2x, Rule Rule 1y, Rule 2y, Rule Rule 1x, Rule 2y, Rule Rule 1y, Rule 2x, Rule Rule 1x, Rule 2x, Rule Rule 1y, Rule 2y, Rule Rule 1x, Rule 2y, Rule Rule 1y, Rule 2x, Rule Rule 1x, Rule 2x, Rule Rule 1y, Rule 2y, Rule Rule 1x, Rule 2y, Rule Rule 1y, Rule 2x, Rule Rule 1x, Rule 2x, Rule

3y, 3y, 3y, 3y, 3x, 3x, 3x, 3x, 3y, 3y, 3y, 3y, 3x, 3x, 3x, 3x,

Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule

4y 4y 4y 4y 4y 4y 4y 4y 4x 4x 4x 4x 4x 4x 4x 4x

46

In this case, the Wallace tree TRS has reached its normal form at the end of the first comparison point. We have not shown the verification of the Carry Save Adders (CSAs)as a part of this proof. The CSAs are verified separately, and the symbol · that has been used in Rules 1a-p is assumed to be correct. An advantage of this technique, is also that the composition of RT level operators is possible. These operators can be uninterpreted and verified separately.

47

Chapter 8 Results and Discussion

8.1 Results
We present the experimental results that we have obtained from our tool. We produce two sets of results, one on a radix 3 Booth multiplier and another on a Wallace Tree multiplier. We show the time taken by the tool for increasing sizes of these multipliers. We have tried to compare our tool to state-of-the-art equivalence checkers. Since the equivalence checkers are most efficient when comparing two gate level designs, we provided gate level implementations of the Booth and Wallace Tree designs as inputs. Although our tool works at the RT level, we have compared the numbers obtained from the gate level verification by the equivalence checkers with our tool output, in order to provide a basis for comparison. It is seen from Table 8.1 and Table 8.2 that the verification of multipliers are performed by

both Commercial Equivalence Checker 1 and Commercial Equivalence Checker 2 in time comparable to our tool. However, in the case of ½

¢½

multipliers, both the

equivalence checkers do not run to completion. Our tool, in comparison, verifies the design in 24 seconds. It can also be seen that as the sizes increase, the time taken by our tool scales linearly with the size of the design.

48

¢ ¢ ½ ¢½ ¿¾ ¢ ¿¾ ¢ ½¾ ¢ ½¾

Booth Multiplier

Verifire 16s 19s 24s 37s 53s 93s

Commercial Commercial Tool 1 Tool 2 12s 9s 20s 16s not completed not completed not completed not completed -

Table 8.1: Comparison of execution times of Verifire against two commercial equivalence checkers for a Booth multiplier of varying sizes. In each case the golden model was a shift and add multiplier of the corresponding size. In order to assist the Commercial Equivalence Checker 1 to compare RTL designs, we tried providing comparison points in the multiplier designs. These intermediary comparison points were the partial products obtained in the two multipliers. The results of this experiment are displayed in the Table 8.3. The Commercial Equivalence Checker runs to completion when assisted manually with comparison points for the

½ ¢½

case. However, it fails to run to completion, even

when assisted by these comparison points, when comparing two

¿¾ ¢ ¿¾ bit or

higher order multipliers. It may be noted that we have provided manual assistance with respect to the comparison points to the equivalence checkers, as opposed to our tool that generates these comparison points automatically. Our tool is effective in verifying multiplier designs that are modifications (usually for optimization) to standard multipliers like Booth, Wallace +tree and Array multipliers. We used the tool for verifying the Verilog implementation of BISMUL [30], a complicated, modified Booth multiplier. In this case, a Booth multiplier verified by our technique was used as the golden design, and the BISMUL was the target design to be 49

Wallace Verifire Multiplier ¢ 14s ¢ 18s ½ ¢½ 25s ¿¾ ¢ ¿¾ 40s ¢ 60s

Commercial Commercial Tool 1 Tool 2 10s 9s 18s 16s not completed not completed not completed not completed -

Table 8.2: Comparison of execution times of Verifire against two commercial equivalence checkers for a Wallace Tree multiplier of varying sizes. In each case the golden model was a shift and add multiplier of the corresponding size. Multiplier Verifire (Booth) ¢ 16s ¢ 19s ½ ¢½ 24s ¿¾ ¢ ¿¾ 37s ¢ 53s Commercial Verifire Tool (Booth) (Wallace) 12s 14s 20s 18s 1942s 25s not completed 40s 60s Commercial Tool (Wallace) 10s 20s 972s not completed -

Table 8.3: Comparison of execution times of Verifire against one commercial equivalence checker assisted by manual comparison points. Results are shown for both booth and wallace tree multipliers. verified. Our tool caught a bug in the Verilog code, that appeared while the tool tried to calculate the partial products after the first comparison point. The symbolic expressions obtained after rewriting, for the observed output (product) variable (P), could not be proved equal at the next comparison point by Vprover. The rule correspondence that the tool had established, as well as the previous comparison point, provided an error trace.

50

8.2 Discussion
We discuss the intuition for the reason our technique can handle large designs as opposed to existing techniques as shown in our experimental results. Our technique is most powerful in the context of multiplier verification. Our tool can efficiently equate two different, RT-level multiplier designs of any width. Equivalence checkers that use BDD-based algorithms [25] cannot handle large sizes of multipliers. Our tool manages to gracefully scale to large, complex multipliers. This is because, we represent circuits at a higher term level as opposed to the Boolean level representation used by the BDD based techniques. This is in part due to the efficiency afforded at the level of terms and from our ability to decompose large monolithic designs. For instance, BDDs, that are widely used for verification in equivalence checking, represent circuits at the Boolean function level. This representation is necessarily canonical, and any comparison of two BDDs implies an exhaustive checking of Boolean formulae. This can get unmanageable in the case of complex formulae. We, however, represent circuits at a higher level, where we capture the system behavior as terms. These terms encapsulate the functionality of the circuit at a block/modular level. Therefore, it is easy and intuitive to decompose the terms into smaller subterms, to make the comparison problem more tractable. A principal reason why our technique gives spectacular gains, is the efficient and effective partitioning of the problem. We compute comparison points automatically. Unlike BDDs, the terms need not be compared only in their normal (canonical) form. They can be decomposed into smaller subterms, that can be com51

pared at intermediate points. The computation of these intermediate points in our technique, is automatic and efficient. The simplification process that performs expression equivalence, although not complete, is extremely efficient for large designs like multipliers. Term rewriting helps graceful scaling of verification to large, complex arithmetic circuits. The tradeoff in term rewriting is that the set of rewriting heuristics cannot be complete [10]. Hence, there is a possibility of a situation where the rewrite engine has to be modified to incorporate more reductions. However, for most of the space of practical designs, the type of heuristics that would be necessary for rewriting are already a part of our rewriter, Vprover.

52

Chapter 9 Conclusions

Our tool is dedicated for arithmetic circuit verification. A comparison to Binary Moment Diagrams [4], a technique that was established for multiplier verification is called for. Although BMDs are more effective than other model checking techniques, our technique achieves significantly more, since we automatically compute comparison/matching points. A tremendous amount of savings are achieved by this. Since we do not deal with intermediate states in the huge state space of multipliers, but use some structural reductions of their implementation to arrive at the comparison points, we can automate this process. A major advantage of our technique is that we our tool accepts synthesizable Verilog as its input. Therefore we do not abstract out any implementation details, that many abstraction techniques in higher level verification do. We have managed to use our technique effectively to verify the entire datapath of microprocessors.(adders, shifters, comparators, multipliers). We plan to extend our tool to incorporate sequential circuits that can handle pipelining, so that we can verify the control paths of microprocessors. The disadvantage in term rewriting, is that the Vprover part of the tool, that implements the reduce() function introduced in Chapter 2 is incomplete. The reduce

53

function uses a database of rules, to simplify the expressions it is comparing. This database of rules may require additional rules to simplify new expressions. However, this incompleteness is traded for the efficiency of the tool. Also, we have tried to incorporate a large number of rules that were needed to simplify the expressions that we encountered in the circuits we have targeted. In its current state, Vprover is very efficient for practical designs. Although BMDs [4] are more effective than other model checking techniques, our technique achieves significantly more, since we automatically compute comparison points. Our technique is similar in spirit to a directed theorem proving approach. However, our technique requires much less user expertise and ingenuity than theorem provers [7], [17], [18], [23]. Our tool is a dedicated arithmetic circuit checker, and can be interfaced with equivalence checkers for arithmetic circuit verification. Another possibility is to integrate our tool with the theorem prover ACL2 [19], so that we can leverage the existing RTL library in ACL2 [9], [22] to add rules to Vprover in a sound manner. Toward this goal, we have implemented our technique for the verification of a RISC pipeline in Appendix A. Our technique is a step toward verification of two generic arithmetic circuits. We have managed to verify a large number of arithmetic circuits using our technique, like adders, shifters, and comparators. This technique can tackle a large part of the multiplier space, and many of the multipliers currently in use.

54

9.1 Future Work
We plan to extend this technique to SRT division circuits and floating point arithmetic circuits. We hope to be able to extend the technique to verify any combinational circuit at the RT-level. We would like to interface our rule-based rewriter, Vprover with the ACL2 rewriting engine, to leverage the large body of work that has been done on improving its rewriting engine. We would also like to interface our tool, Verifire, with state-of-the-art equivalence checkers, to provide a complimentary, efficient verification environment. Our tool can handle the complicated designs at the RT-level, and use the technology of Boolean level engines to automatically verify designs in their application domain. We plan to, in the future, also accomplish property-based verification using our technique and verify the entire datapath of microprocessors.

55

Appendices

56

Appendix A An ACL2 Implementation of our Technique

A.1

Project Description
This project involves the implementation of a new verification technique

in ACL2, and its application in verifying real designs. There are two parts of the project:

¯ ¯

Verification of 16 bit arithmetic operations of 74181 ALUs cascaded with a 74182 carry lookahead generator. Verification of a RISC pipeline that uses the 74181 ALU as its execution unit. Section A.2 describes the modeling of the technique in the ACL2 environ-

ment. The Section A.3 gives the actual ACL2 descriptions used in applying the technique to the

½

bit verification of 74181 addition operation. Section A.4 de-

scribes the RISC pipeline in ACL2.

A.2

Verification of the 74181 ALU in ACL2

A.2.1 Using ACL2 The technique outlined in Chapter 2 is used within the ACL2 environment. The designs are modeled as function definitions in ACL2. We have seen that there 57

are two types of rewriting that are being performed in the technique. One type of rewriting is a part of the design itself that generates the comparable terms. The other is the rewriting that is necessary to prove expression equivalence at every comparison point. Modeling terms as ACL2 functions incorporates this internal rewriting. Comparison points are given externally to ACL2, and an ACL2 lemma is generated to check the equivalence of the two designs at each of these points. The proof of these lemmas, forms the second type of rewriting described above. The main theorem is to prove that the expression equivalence holds at every comparison point. We have used ACL2 to prove the expression equivalence. A.2.2 The 74181 ALU The 74181 ALU [15] is a 4 bit ALU that performs

¿¾ different arithmetic

and logical operations. The arithmetic operations are given below. Since the technique is most effective for arithmetic circuits, we have verified only those operations of the 74181 IC.The ACL2 description of the 74181 was obtained from the literature. The arithmetic operations verified are addition, subtraction, increment, and decrement. The addition operation can be performed by assigning the corresponding values to the mode and select bits of the 74181. For instance, the addition operation can be invoked by the following function.

(defun f74181-adder (cin a0 a1 a2 a3 b0 b1 b2 b3)

58

(let* ((outs (f74181 (not cin) a0 a1 a2 a3 b0 b1 b2 b3 nil t nil nil t)) (out1 (nth 0 outs)) (out2 (nth 1 outs)) (out3 (nth 2 outs)) (out4 (nth 3 outs)) (out5 (nth 4 outs))) (list out1 out2 out3 out4 (not out5))))

A ripple carry adder is used as the reference model. The ripple carry adder is first proved to function correctly, against the

· operator in ACL2.

The ACL2

description of the gate level ripple carry adder design is as follows.

(defun carry (a b c) (if a (or b c) (and b c)))

(defun rca1bit (cin a b) (list (xor3 a b cin) (carry a b cin)))

(defun rca4bit (cin a0 a1 a2 a3 b0 b1 b2 b3) (let* ((state0 (rca1bit cin a0 b0)) (s0 (car state0)) (c1 (cadr state0)) 59

(state1 (rca1bit c1 a1 b1)) (s1 (car state1)) (c2 (cadr state1)) (state2 (rca1bit c2 a2 b2)) (s2 (car state2)) (c3 (cadr state2)) (state3 (rca1bit c3 a3 b3)) (s3 (car state3)) (cout (cadr state3))) (list s0 s1 s2 s3 cout)))

The 74181 can be cascaded using the SN74182 carry lookahead generator to form a 16 bit ALU. The verification of the 16 bit adder operation of 74181 is described in the next section.

A.3

Applying the technique to 16 bit adder operation of 74181 in ACL2
The revised design is the cascaded 74181 ICs functioning as a 16 bit adder.

The ACL2 description is given below.

(defun 16bit74181 (cin a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15) 60

(let* ((P0 (nth 5 (f74181-adder cin a0 a1 a2 a3 b0 b1 b2 b3))) (P1 (nth 5 (f74181-adder nil a4 a5 a6 a7 b4 b5 b6 b7))) (P2 (nth 5 (f74181-adder nil a8 a9 a10 a11 b8 b9 b10 b11))) (P3 (nth 5 (f74181-adder nil a12 a13 a14 a15 b12 b13 b14 b15))) (G0 (nth 6 (f74181-adder cin a0 a1 a2 a3 b0 b1 b2 b3))) (G1 (nth 6 (f74181-adder nil a4 a5 a6 a7 b4 b5 b6 b7))) (G2 (nth 6 (f74181-adder nil a8 a9 a10 a11 b8 b9 b10 b11))) (G3 (nth 6 (f74181-adder nil a12 a13 a14 a15 b12 b13 b14 b15))) (pglist (list cin P0 P1 P2 P3 G0 G1 G2 G3)) (carrylist (sn74182 pglist)) (c3 (nth 0 carrylist)) (c7 (nth 1 carrylist)) (c11 (nth 2 carrylist)) (adder1 (f74181-adder cin a0 a1 a2 a3 b0 b1 b2 b3))

61

(adder2

(f74181-adder c3 a4 a5 a6 a7 b4 b5 b6 b7))

(adder3

(f74181-adder c7 a8 a9 a10 a11 b8 b9 b10 b11))

(adder4

(f74181-adder c11 a12 a13 a14 a15 b12 b13 b14 b15)))

(list (nth 0 adder1) (nth 1 adder1) (nth 2 adder1) (nth 3 adder1) (nth 0 adder2) (nth 1 adder2) (nth 2 adder2) (nth 3 adder2) (nth 0 adder3) (nth 1 adder3) (nth 2 adder3) (nth 3 adder3) (nth 0 adder4) (nth 1 adder4) (nth 2 adder4) (nth 3 adder4) (nth 4 adder4))))

The observation function is defined for both the golden (reference) model as follows. In the case of the adder, it is the sum output at every comparison point.

(defun obs_g (m cin a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15) (let* ((outputs (rca16bit cin a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 62

b13 b14 b15)) (out0 (nth 0 outputs)) (out1 (nth 1 outputs)) (out2 (nth 2 outputs)) (out3 (nth 3 outputs)) (out4 (nth 4 outputs)) (out5 (nth 5 outputs)) (out6 (nth 6 outputs)) (out7 (nth 7 outputs)) (out8 (nth 8 outputs)) (out9 (nth 9 outputs)) (out10 (nth 10 outputs)) (out11 (nth 11 outputs)) (out12 (nth 12 outputs)) (out13 (nth 13 outputs)) (out14 (nth 14 outputs)) (out15 (nth 15 outputs))) (cond ((equal m 1) (list out0 out1 out2 out3)) ((equal m 2) (list out4 out5 out6 out7)) ((equal m 3) (list out8 out9 out10 out11))

63

((equal m 4) (list out12 out13 out14 out15)))))

Similarly, the observation function for the revised design as defined as follows.

(defun obs_r (m cin a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15) (let* ((outputs (16bit74181 cin a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15)) (out0 (nth 0 outputs)) (out1 (nth 1 outputs)) (out2 (nth 2 outputs)) (out3 (nth 3 outputs)) (out4 (nth 4 outputs)) (out5 (nth 5 outputs)) (out6 (nth 6 outputs)) (out7 (nth 7 outputs)) 64

(out8 (nth 8 outputs)) (out9 (nth 9 outputs)) (out10 (nth 10 outputs)) (out11 (nth 11 outputs)) (out12 (nth 12 outputs)) (out13 (nth 13 outputs)) (out14 (nth 14 outputs)) (out15 (nth 15 outputs))) (cond ((equal m 1) (list out0 out1 out2 out3)) ((equal m 2) (list out4 out5 out6 out7)) ((equal m 3) (list out8 out9 out10 out11)) ((equal m 4) (list out12 out13 out14 out15))))) The conditional statement where

Ñ takes different values, corresponds to

the different comparison points, and the observation function at each comparison point. For instance, at the first comparison point, the first four bits of sum obtained in both the system are compared. The comparison points are provided to the ACL2 proof engine. In the 74181, after a single rewriting step the sum of the first four bits is obtained. However, in the ripple carry model, the sum of a single bit is obtained at the end of every 65

rewriting step. So, the ripple carry adder is stepped times, and the sum of the first four bits is compared at the first comparison point. At the last comparison point, the normal form of the two systems is compared, i.e the last comparison point is at the state where the ½ Ø bits of the sum are obtained. The main theorem states that the two TRSs are equal if the observation functions at every comparison point are proved equal. We used ACL2 to prove the expression equivalence of outputs at each comparison point by proving the following lemma that establishes equivalence of four bits of the sum outputs at any comparison point.

(defthm 4biteq (implies (and (Booleanp cin) (Booleanp a0) (Booleanp a1) (Booleanp a2) (Booleanp a3) (Booleanp b0) (Booleanp b1) (Booleanp b2) (Booleanp b3)) (equal (rca4bit cin a0 a1 a2 a3 b0 b1 b2 b3)

66

(f74181-adder cin a0 a1 a2 a3 b0 b1 b2 b3)))) ACL2 proved both these theorems, thereby establishing the equivalence of the two designs. Other arithmetic operations of the 74181 ALU were also verified like Subtraction, Increment, Decrement. The ripple carry adder model was used as the reference for these operations also. The proof procedure is very similar to that described for the addition.

A.4

Verification of a RISC pipeline using our technique
Our technique can be extended to a full-fledged RISC pipeline verification.

To illustrate this technique, we are currently working on a simple RISC pipeline that consists of four operations per instruction; fetch, decode, execute and write back. The processor system has an instruction memory ia, a register file rf and a program counter pc.The execution phase calls the 74181 ALU module to execute the operations. The ACL2 definitions for these four operations are: (defconst *initialrf* (list (list r1 nil) (list r2 nil) (list r3 nil) (list r4 nil) (list r5 nil) (list r6 nil) (list r7 nil) (list r8 nil)))

(defconst *im* (list (list ’add ’r1 ’r2) 67

(list ’sub ’r4 ’r5) (list ’inc ’r6 ’r7) (list ’dec ’r8 ’r9)))

(defun fetch( pc rf im) (let* ((ir (car (car rf)))) (list (+ 1 pc) (list (list ir (nth pc im)) (cdr rf)) im nil)))

(defun decode(pc rf im) (let* ((fetchedi (car (cadr (car rf))))) (cond ((equal fetchedi ’add) (list pc rf im (list t nil t nil nil t) nil)) ((equal fetchedi ’sub) (list pc rf im (list nil nil nil t t nil) nil)) ((equal fetchedi ’inc) (list pc rf im (list nil nil t t t t) nil)) ((equal fetchedi ’dec)

68

(list pc rf im (list t nil nil nil nil nil) nil)))))

(defun excecute(pc rf im cntrl) (let* ((reg1 (cond ((equal(cadr (cadr (nth 0 rf))) ’r2) (cadr(nth 1 rf))) ((equal(cadr (cadr (nth 0 rf))) ’r3) (cadr(nth 2 rf))) ((equal(cadr (cadr (nth 0 rf))) ’r4) (cadr(nth 3 rf))) ((equal(cadr (cadr (nth 0 rf))) ’r5) (cadr(nth 4 rf))) ((equal(cadr (cadr (nth 0 rf))) ’r6) (cadr(nth 5 rf))) ((equal(cadr (cadr (nth 0 rf))) ’r7) (cadr(nth 6 rf))) ((equal(cadr (cadr (nth 0 rf))) ’r8) (cadr(nth 7 rf))) ((equal(cadr (cadr (nth 0 rf))) ’r9) (cadr(nth 8 rf))))) (reg2 (cond ((equal(caddr (cadr (nth 0 rf))) ’r2) (cdr(nth 1 rf)))

69

((equal(caddr (cadr (nth 0 rf))) ’r3) (cdr(nth 2 rf))) ((equal(caddr (cadr (nth 0 rf))) ’r4) (cdr(nth 3 rf))) ((equal(caddr (cadr (nth 0 rf))) ’r5) (cdr(nth 4 rf))) ((equal(caddr (cadr (nth 0 rf))) ’r6) (cdr(nth 5 rf))) ((equal(caddr (cadr (nth 0 rf))) ’r7) (cdr(nth 6 rf))) ((equal(caddr (cadr (nth 0 rf))) ’r8) (cdr(nth 7 rf))) ((equal(caddr (cadr (nth 0 rf))) ’r9) (cdr(nth 8 rf)))))) (list pc rf im (f74181 (nth 0 cntrl)(nth 0 reg1)(nth 1 reg1) (nth 2 reg1)(nth 3 reg1)(nth 0 reg2) (nth 1 reg2)(nth 2 reg2)(nth 3 reg2) (nth 1 cntrl) (nth 2 cntrl)(nth 3 cntrl) (nth 4 cntrl)(nth 5 cntrl)))))

(defun writeback(pc rf im retval) (let* ((reg2 (caddr (cadr (nth 0 rf))))

70

(out0 (nth 0 retval)) (out1 (nth 1 retval)) (out2 (nth 2 retval)) (out3 (nth 3 retval))) (put-assoc-eq reg2 (list out0 out1 out2 out3) rf)))

The execution phase calls the 74181 ALU module to execute the operations. The reference design in this case can be a non-pipelined machine system that takes four machine cycles to process a single instruction. The comparison point for both these machine systems will be after instructions are executed in the two system. cycles This takes ½ machine cycles for the non pipelined system, as opposed to

in the pipelined machine system. We are working on the proof of equivalence of these two machines.

71

Appendix B Verilog Code for the Shift-and-Add Multiplier

The Verilog code for a

½

-bit Shift-and-Add multiplier is described here.

The shfadd 16bit is the main module which in turn refers to a ¿¾-bit full adder (serialadd 32bit) and a ¿¾-multiplexer (mux 32bit). module fulladder (a, b, c, x, y); input a, b, c; output x, y; wire w_x, w_y;

assign x = w_x; assign y = w_y;

not (a_, a); not (b_, b); nand (an, a_, b); nand (bn, a, b_); nand (axb, an, bn); not (c_, c); 72

not (axb_, axb); nand (axbn, axb_, c); nand (cn, c_, axb); nand (w_x, axbn, cn);

nand (anb, a, b); nand (anc, a, c); nand (bnc, b, c); and (anbanc, anb, anc); nand (w_y, anbanc, bnc); endmodule // fulladder

module serialadd_32bit (a, b, cin, s, cout); input [31:0] input cin; s; a, b;

output [31:0] output

cout; carry;

wire [30:0]

fulladder (a[0], b[0], cin, s[0], carry[0]); fulladder (a[1], b[1], carry[0], s[1], carry[1]); fulladder (a[2], b[2], carry[1], s[2], carry[2]); fulladder (a[3], b[3], carry[2], s[3], carry[3]);

73

fulladder (a[4], b[4], carry[3], s[4], carry[4]); fulladder (a[5], b[5], carry[4], s[5], carry[5]); fulladder (a[6], b[6], carry[5], s[6], carry[6]); fulladder (a[7], b[7], carry[6], s[7], carry[7]); fulladder (a[8], b[8], carry[7], s[8], carry[8]); fulladder (a[9], b[9], carry[8], s[9], carry[9]); fulladder (a[10], b[10], carry[9], s[10], carry[10]); fulladder (a[11], b[11], carry[10], s[11], carry[11]); fulladder (a[12], b[12], carry[11], s[12], carry[12]); fulladder (a[13], b[13], carry[12], s[13], carry[13]); fulladder (a[14], b[14], carry[13], s[14], carry[14]); fulladder (a[15], b[15], carry[14], s[15], carry[15]); fulladder (a[16], b[16], carry[15], s[16], carry[16]); fulladder (a[17], b[17], carry[16], s[17], carry[17]); fulladder (a[18], b[18], carry[17], s[18], carry[18]); fulladder (a[19], b[19], carry[18], s[19], carry[19]); fulladder (a[20], b[20], carry[19], s[20], carry[20]); fulladder (a[21], b[21], carry[20], s[21], carry[21]); fulladder (a[22], b[22], carry[21], s[22], carry[22]); fulladder (a[23], b[23], carry[22], s[23], carry[23]); fulladder (a[24], b[24], carry[23], s[24], carry[24]); fulladder (a[25], b[25], carry[24], s[25], carry[25]);

74

fulladder (a[26], b[26], carry[25], s[26], carry[26]); fulladder (a[27], b[27], carry[26], s[27], carry[27]); fulladder (a[28], b[28], carry[27], s[28], carry[28]); fulladder (a[29], b[29], carry[28], s[29], carry[29]); fulladder (a[30], b[30], carry[29], s[30], carry[30]); fulladder (a[31], b[31], carry[30], s[31], cout); endmodule // serialadd_32bit

module mux2to1 (a, b, s, o); input a, b, s;

output o; wire s_n, as, bs;

not (s_n, s); and (as, a, s); and (bs, b, s_n); or (o, as, bs); endmodule // mux2to1

module mux_32bit (select, in, out); input select;

input [31:0] in; output [31:0] out;

75

mux2to1 (in[0], 1’b0, select, out[0]); mux2to1 (in[1], 1’b0, select, out[1]); mux2to1 (in[2], 1’b0, select, out[2]); mux2to1 (in[3], 1’b0, select, out[3]); mux2to1 (in[4], 1’b0, select, out[4]); mux2to1 (in[5], 1’b0, select, out[5]); mux2to1 (in[6], 1’b0, select, out[6]); mux2to1 (in[7], 1’b0, select, out[7]); mux2to1 (in[8], 1’b0, select, out[8]); mux2to1 (in[9], 1’b0, select, out[9]); mux2to1 (in[10], 1’b0, select, out[10]); mux2to1 (in[11], 1’b0, select, out[11]); mux2to1 (in[12], 1’b0, select, out[12]); mux2to1 (in[13], 1’b0, select, out[13]); mux2to1 (in[14], 1’b0, select, out[14]); mux2to1 (in[15], 1’b0, select, out[15]); mux2to1 (in[16], 1’b0, select, out[16]); mux2to1 (in[17], 1’b0, select, out[17]); mux2to1 (in[18], 1’b0, select, out[18]); mux2to1 (in[19], 1’b0, select, out[19]); mux2to1 (in[20], 1’b0, select, out[20]); mux2to1 (in[21], 1’b0, select, out[21]);

76

mux2to1 (in[22], 1’b0, select, out[22]); mux2to1 (in[23], 1’b0, select, out[23]); mux2to1 (in[24], 1’b0, select, out[24]); mux2to1 (in[25], 1’b0, select, out[25]); mux2to1 (in[26], 1’b0, select, out[26]); mux2to1 (in[27], 1’b0, select, out[27]); mux2to1 (in[28], 1’b0, select, out[28]); mux2to1 (in[29], 1’b0, select, out[29]); mux2to1 (in[30], 1’b0, select, out[30]); mux2to1 (in[31], 1’b0, select, out[31]); endmodule // mux_32bit

module shfadd_16bit (A, B, P); input [15:0] A, B;

output [31:0] P;

wire [31:0] p_0, p_1, p_2, p_3, p_4, p_5, p_6, p_7; wire [31:0] p_8, p_9, p_10, p_11, p_12, p_13, p_14, p_15; wire cout; pp1, pp2, pp3, pp4, pp5, pp6, pp7, pp8; pp9, pp10, pp11, pp12, pp13, pp14, pp15;

wire [31:0] wire [31:0]

77

assign

P = p_15;

mux_32bit (A[0], {16’b0, B}, p_0);

mux_32bit (A[1], {15’b0, B, 1’b0}, pp1); serialadd_32bit (p_0, pp1, 1’b0, p_1, cout);

mux_32bit (A[2], {14’b0, B, 2’b0}, pp2); serialadd_32bit (p_1, pp2, 1’b0, p_2, cout);

mux_32bit (A[3], {13’b0, B, 3’b0}, pp3); serialadd_32bit (p_2, pp3, 1’b0, p_3, cout);

mux_32bit (A[4], {12’b0, B, 4’b0}, pp4); serialadd_32bit (p_3, pp4, 1’b0, p_4, cout);

mux_32bit (A[5], {11’b0, B, 5’b0}, pp5); serialadd_32bit (p_4, pp5, 1’b0, p_5, cout);

mux_32bit (A[6], {10’b0, B, 6’b0}, pp6); serialadd_32bit (p_5, pp6, 1’b0, p_6, cout);

mux_32bit (A[7], {9’b0, B, 7’b0}, pp7);

78

serialadd_32bit (p_6, pp7, 1’b0, p_7, cout);

mux_32bit (A[8], {8’b0, B, 8’b0}, pp8); serialadd_32bit (p_7, pp8, 1’b0, p_8, cout);

mux_32bit (A[9], {7’b0, B, 9’b0}, pp9); serialadd_32bit (p_8, pp9, 1’b0, p_9, cout);

mux_32bit (A[10], {6’b0, B, 10’b0}, pp10); serialadd_32bit (p_9, pp10, 1’b0, p_10, cout);

mux_32bit (A[11], {5’b0, B, 11’b0}, pp11); serialadd_32bit (p_10, pp11, 1’b0, p_11, cout);

mux_32bit (A[12], {4’b0, B, 12’b0}, pp12); serialadd_32bit (p_11, pp12, 1’b0, p_12, cout);

mux_32bit (A[13], {3’b0, B, 13’b0}, pp13); serialadd_32bit (p_12, pp13, 1’b0, p_13, cout);

mux_32bit (A[14], {2’b0, B, 14’b0}, pp14); serialadd_32bit (p_13, pp14, 1’b0, p_14, cout);

79

mux_32bit (A[15], {1’b0, B, 15’b0}, pp15); serialadd_32bit (p_14, pp15, 1’b0, p_15, cout); endmodule // shfadd_16bit

80

Appendix C Verilog Code for the Booth Multiplier

The Verilog code for a ½ -bit Booth multiplier is described here. The booth 16bit is the main module which in turn refers to a ¿¾-bit -way multiplexer (mux8way 32bit) and a ¿¾-bit full adder (serialadd 32bit). The full adder code is the same as in Appendix B.

module ppgen (m, pp0, pp1, pp2, pp3, pp4, pp5, pp6, pp7); input [31:0] output [31:0] m; pp0, pp1, pp2, pp3, pp4, pp5, pp6, pp7;

wire [31:0] wire c;

p3, p5, p6, p7;

assign assign assign assign assign assign

pp0 = 32’b0; pp1 = m;

//pp0

//pp1 //pp2

pp2 = {m[30:0], 1’b0}; pp3 = p3; //pp3

pp4 = {m[29:0], 2’b0}; pp5 = p5; //pp5 81

//pp4

assign assign

pp6 = p6; pp7 = p7;

//pp6 //pp7

serialadd_32bit ({m[30:0],1’b0}, m, 1’b0, p3, c); serialadd_32bit ({m[29:0],2’b0}, m, 1’b0, p5, c); serialadd_32bit ({m[30:0],1’b0}, {m[29:0],2’b0}, 1’b0, p6, c); serialsub_32bit ({m[28:0],3’b0}, m, p7, c);

endmodule // ppgen

module mux8way (select, p0, p1, p2, p3, p4, p5, p6, p7, po); input [2:0] input output select;

p0, p1, p2, p3, p4, p5, p6, p7; po;

wire

po_10, po_11, po_12, po_13, po_20, po_21;

mux2to1 (p1, p0, select[0], po_10); mux2to1 (p3, p2, select[0], po_11); mux2to1 (p5, p4, select[0], po_12); mux2to1 (p7, p6, select[0], po_13);

mux2to1 (po_11, po_10, select[1], po_20);

82

mux2to1 (po_13, po_12, select[1], po_21);

mux2to1 (po_21, po_20, select[2], po);

endmodule // mux8way

module mux8way_32bit (select, pp0, pp1, pp2, pp3, pp4, pp5, pp6, pp7, ppout); input [2:0] input [31:0] output [31:0] select; pp0, pp1, pp2, pp3, pp4, pp5, pp6, pp7; ppout;

mux8way (select, pp0[0], pp1[0], pp2[0], pp3[0], pp4[0], pp5[0], pp6[0], pp7[0], ppout[0]); mux8way (select, pp0[1], pp1[1], pp2[1], pp3[1], pp4[1], pp5[1], pp6[1], pp7[1], ppout[1]); mux8way (select, pp0[2], pp1[2], pp2[2], pp3[2], pp4[2], pp5[2], pp6[2], pp7[2], ppout[2]); mux8way (select, pp0[3], pp1[3], pp2[3], pp3[3], pp4[3], pp5[3], pp6[3], pp7[3], ppout[3]) mux8way (select, pp0[4], pp1[4], pp2[4], pp3[4], pp4[4], pp5[4],pp6[4], pp7[4], ppout[4]) mux8way (select, pp0[5], pp1[5], pp2[5], pp3[5], pp4[5],

83

pp5[5], pp6[5], pp7[5], ppout[5]) mux8way (select, pp0[6], pp1[6], pp2[6], pp3[6], pp4[6], pp5[6], pp6[6], pp7[6], ppout[6]) mux8way (select, pp0[7], pp1[7], pp2[7], pp3[7], pp4[7], pp5[7], pp6[7], pp7[7], ppout[7]) mux8way (select, pp0[8], pp1[8], pp2[8], pp3[8], pp4[8], pp5[8], pp6[8], pp7[8], ppout[8]) mux8way (select, pp0[9], pp1[9], pp2[9], pp3[9], pp4[9], pp5[9], pp6[9], pp7[9], ppout[9]) mux8way (select, pp0[10], pp1[10], pp2[10], pp3[10], pp4[10], pp5[10], pp6[10], pp7[10], ppout[10]) mux8way (select, pp0[11], pp1[11], pp2[11], pp3[11], pp4[11], pp5[11], pp6[11], pp7[11], ppout[11]) mux8way (select, pp0[12], pp1[12], pp2[12], pp3[12], pp4[12], pp5[12], pp6[12], pp7[12], ppout[12]) mux8way (select, pp0[13], pp1[13], pp2[13], pp3[13], pp4[13], pp5[13], pp6[13], pp7[13], ppout[13]) mux8way (select, pp0[14], pp1[14], pp2[14], pp3[14], pp4[14], pp5[14], pp6[14], pp7[14], ppout[14]) mux8way (select, pp0[15], pp1[15], pp2[15], pp3[15], pp4[15], pp5[15], pp6[15], pp7[15], ppout[15]) mux8way (select, pp0[16], pp1[16], pp2[16], pp3[16], pp4[16], pp5[16], pp6[16], pp7[16], ppout[16])

84

mux8way (select, pp0[17], pp1[17], pp2[17], pp3[17], pp4[17], pp5[17], pp6[17], pp7[17], ppout[17]) mux8way (select, pp0[18], pp1[18], pp2[18], pp3[18], pp4[18], pp5[18], pp6[18], pp7[18], ppout[18]) mux8way (select, pp0[19], pp1[19], pp2[19], pp3[19], pp4[19], pp5[19], pp6[19], pp7[19], ppout[19]) mux8way (select, pp0[20], pp1[20], pp2[20], pp3[20], pp4[20], pp5[20], pp6[20], pp7[20], ppout[20]) mux8way (select, pp0[21], pp1[21], pp2[21], pp3[21], pp4[21], pp5[21], pp6[21], pp7[21], ppout[21]) mux8way (select, pp0[22], pp1[22], pp2[22], pp3[22], pp4[22], pp5[22], pp6[22], pp7[22], ppout[22]) mux8way (select, pp0[23], pp1[23], pp2[23], pp3[23], pp4[23], pp5[23], pp6[23], pp7[23], ppout[23]) mux8way (select, pp0[24], pp1[24], pp2[24], pp3[24], pp4[24], pp5[24], pp6[24], pp7[24], ppout[24]) mux8way (select, pp0[25], pp1[25], pp2[25], pp3[25], pp4[25], pp5[25], pp6[25], pp7[25], ppout[25]) mux8way (select, pp0[26], pp1[26], pp2[26], pp3[26], pp4[26], pp5[26], pp6[26], pp7[26], ppout[26]) mux8way (select, pp0[27], pp1[27], pp2[27], pp3[27], pp4[27], pp5[27], pp6[27], pp7[27], ppout[27]) mux8way (select, pp0[28], pp1[28], pp2[28], pp3[28], pp4[28],

85

pp5[28], pp6[28], pp7[28], ppout[28]) mux8way (select, pp0[29], pp1[29], pp2[29], pp3[29], pp4[29], pp5[29], pp6[29], pp7[29], ppout[29]) mux8way (select, pp0[30], pp1[30], pp2[30], pp3[30], pp4[30], pp5[30], pp6[30], pp7[30], ppout[30]) mux8way (select, pp0[31], pp1[31], pp2[31], pp3[31], pp4[31], pp5[31], pp6[31], pp7[31], ppout[31])

endmodule // mux8way_32bit

module booth_16bit (A, B, P); input [15:0] output [31:0] A, B; P;

wire [31:0] wire [31:0] wire [31:0] wire [31:0] wire [31:0] wire [31:0] wire [31:0] wire [31:0]

pp10, pp11, pp12, pp13, pp14, pp15, pp16, pp17; pp20, pp21, pp22, pp23, pp24, pp25, pp26, pp27; pp30, pp31, pp32, pp33, pp34, pp35, pp36, pp37; pp40, pp41, pp42, pp43, pp44, pp45, pp46, pp47; pp50, pp51, pp52, pp53, pp54, pp55, pp56, pp57; pp60, pp61, pp62, pp63, pp64, pp65, pp66, pp67; p, pp, pp1, pp2, pp3, ppout1, ppout2; ppout3, ppout4, ppout5, ppout6;

86

assign

P = p;

ppgen ({16’b0,A}, pp10, pp11, pp12, pp13, pp14, pp15, pp16, pp17); mux8way_32bit (B[2:0], pp10, pp11, pp12, pp13, pp14, pp15, pp16, pp17, ppout1);

ppgen ({13’b0,A,3’b0}, pp20, pp21, pp22, pp23, pp24, pp25, pp26, pp27); mux8way_32bit (B[5:3], pp20, pp21, pp22, pp23, pp24, pp25, pp26, pp27, ppout2); serialadd_32bit (ppout1, ppout2, 1’b0, pp, c);

ppgen ({10’b0,A,6’b0}, pp30, pp31, pp32, pp33, pp34, pp35, pp36, pp37); mux8way_32bit (B[8:6], pp30, pp31, pp32, pp33, pp34, pp35, pp36, pp37, ppout3); serialadd_32bit (pp, ppout3, 1’b0, pp1, c);

ppgen ({7’b0,A,9’b0}, pp40, pp41, pp42, pp43, pp44, pp45, pp46, pp47); mux8way_32bit (B[11:9], pp40, pp41, pp42, pp43, pp44, pp45, pp46, pp47, ppout4);

87

serialadd_32bit (pp1, ppout4, 1’b0, pp2, c);

ppgen ({4’b0,A,12’b0}, pp50, pp51, pp52, pp53, pp54, pp55, pp56, pp57); mux8way_32bit (B[14:12], pp50, pp51, pp52, pp53, pp54, pp55, pp56, pp57, ppout5); serialadd_32bit (pp2, ppout5, 1’b0, pp3, c);

ppgen ({1’b0,A,15’b0}, pp60, pp61, pp62, pp63, pp64, pp65, pp66, pp67); mux8way_32bit ({2’b0,B[15]}, pp60, pp61, pp62, pp63, pp64, pp65, pp66, pp67, ppout6); serialadd_32bit (pp3, ppout6, 1’b0, p, c);

endmodule // booth_16bit

88

Appendix D Verilog Code for the BISMUL Multiplier

The Verilog code for a

-bit BISMUL multiplier is described here. mul radix8$

is the main multiplier module. In turn, it refers to a 73-bit carry save adder (csa 5op$), a 16-bit multiplexer along with shift by 1 bit (mplier shft16b$), and a partial product generator (ppreg$)

module ppreg$ (out,mul, iclk); input [63:0] mul; input iclk; output [669:0] out;

wire set0=1’b0; wire set1=1’b1; wire [66:0] pp0, pp1, pp2, pp3, pp4, pp5, pp6, pp7; wire [66:0] x3mul, x5mul, x7mul, pp8, ppn1; wire [66:0] p4_temp;

assign pp1= out[133:67]; assign pp2= out[200:134]; 89

assign pp3= out[267:201]; assign pp4= out[334:268]; assign pp5= out[401:335]; assign pp6= out[468:402]; assign pp7= out[535:469]; assign pp8= out[602:536]; assign ppn1= out[669:603];

samp_hold67$ sh1 (out[133:67], {3’b0, mul}, iclk); samp_hold67$ sh2 (out[200:134], {2’b0, mul, 1’b0}, iclk); latch67$ lat3 (out[267:201], x3mul, iclk); samp_hold67$ sh4 (out[334:268], {1’b0, mul, 2’b0}, iclk);

latch67$ lat5 (out[401:335], x5mul, iclk); latch67$ lat6 (out[468:402], {x3mul[65:0],1’b0}, iclk); latch67$ lat7 (out[535:469], x7mul, iclk); inv67$ invm1 (out[669:603], {3’b0, mul});

adder67b$ adder0 (x3mul, out[133:67], out[200:134], set0); adder67b$ adder1 (x5mul, out[133:67], out[334:268], set0); adder67b$ adder2 (x7mul, out[669:603], {mul, 3’b0}, set1); endmodule

90

module mplier_shft16b$ (out, in, pen_a, iclk); input [15:0] in; input iclk; input pen_a; output [17:0] out; wire set1=1’b1; wire set0=1’b0;

pareg1b$ pa16 (out[17], in[15], set0, pen_a, set1, iclk); pareg1b$ pa15 (out[16], in[14], set0, pen_a, set1, iclk); pareg1b$ pa14 (out[15], in[13], set0, pen_a, set1, iclk); pareg1b$ pa13 (out[14], in[12], out[17], pen_a, set1, iclk); pareg1b$ pa12 (out[13], in[11], out[16], pen_a, set1, iclk); pareg1b$ pa11 (out[12], in[10], out[15], pen_a, set1, iclk); pareg1b$ pa10 (out[11], in[9], out[14], pen_a, set1, iclk); pareg1b$ pa9 (out[10], in[8], out[13], pen_a, set1, iclk); pareg1b$ pa8 (out[9], in[7], out[12], pen_a, set1, iclk); pareg1b$ pa7 (out[8], in[6], out[11], pen_a, set1, iclk); pareg1b$ pa6 (out[7], in[5], out[10], pen_a, set1, iclk); pareg1b$ pa5 (out[6], in[4], out[9], pen_a, set1, iclk); pareg1b$ pa4 (out[5], in[3], out[8], pen_a, set1, iclk); pareg1b$ pa3 (out[4], in[2], out[7], pen_a, set1, iclk); pareg1b$ pa2 (out[3], in[1], out[6], pen_a, set1, iclk);

91

pareg1b$ pa1 (out[2], in[0], out[5], pen_a, set1, iclk); pareg1b$ pa0 (out[1], set0, out[4], pen_a, set1, iclk); pareg1b$ pad5 (out[0], set0, out[3], pen_a, set1, iclk); endmodule

module csa_5op$ (cout, prod, v, w, x, y, z); input [75:0] v; input [75:0] w; input [75:0] x; input [75:0] y; input [75:0] z; output [75:0] prod; output cout; wire [75:0] aop, bop;

ha$ ha0 (prod[0], c1_1, v[0], w[0]); ha$ ha1 (s1_1, c1_2, v[1], w[1]); ha$ ha2 (s1_2, c1_3, v[2], w[2]); fa$ fa0 (s1_3, c1_4, v[3], w[3], x[3]); fa$ fa1 (s1_4, c1_5, v[4], w[4], x[4]); fa$ fa2 (s1_5, c1_6, v[5], w[5], x[5]); fa$ fa3 (s1_6, c1_7, v[6], w[6], x[6]); fa$ fa4 (s1_7, c1_8, v[7], w[7], x[7]);

92

fa$ fa5 (s1_8, c1_9, v[8], w[8], x[8]); fa$ fa6 (s1_9, c1_10, v[9], w[9], x[9]); fa$ fa7 (s1_10, c1_11, v[10], w[10], x[10]); fa$ fa8 (s1_11, c1_12, v[11], w[11], x[11]); fa$ fa9 (s1_12, c1_13, v[12], w[12], x[12]); fa$ fa10 (s1_13, c1_14, v[13], w[13], x[13]); fa$ fa11 (s1_14, c1_15, v[14], w[14], x[14]); fa$ fa12 (s1_15, c1_16, v[15], w[15], x[15]); fa$ fa13 (s1_16, c1_17, v[16], w[16], x[16]); fa$ fa14 (s1_17, c1_18, v[17], w[17], x[17]); fa$ fa15 (s1_18, c1_19, v[18], w[18], x[18]); fa$ fa16 (s1_19, c1_20, v[19], w[19], x[19]); fa$ fa20 (s1_20, c1_21, v[20], w[20], x[20]); fa$ fa21 (s1_21, c1_22, v[21], w[21], x[21]); fa$ fa22 (s1_22, c1_23, v[22], w[22], x[22]); fa$ fa23 (s1_23, c1_24, v[23], w[23], x[23]); fa$ fa24 (s1_24, c1_25, v[24], w[24], x[24]); fa$ fa25 (s1_25, c1_26, v[25], w[25], x[25]); fa$ fa26 (s1_26, c1_27, v[26], w[26], x[26]); fa$ fa27 (s1_27, c1_28, v[27], w[27], x[27]); fa$ fa28 (s1_28, c1_29, v[28], w[28], x[28]); fa$ fa29 (s1_29, c1_30, v[29], w[29], x[29]); fa$ fa30 (s1_30, c1_31, v[30], w[30], x[30]);

93

fa$ fa31 (s1_31, c1_32, v[31], w[31], x[31]); fa$ fa32 (s1_32, c1_33, v[32], w[32], x[32]); fa$ fa33 (s1_33, c1_34, v[33], w[33], x[33]); fa$ fa34 (s1_34, c1_35, v[34], w[34], x[34]); fa$ fa35 (s1_35, c1_36, v[35], w[35], x[35]); fa$ fa36 (s1_36, c1_37, v[36], w[36], x[36]); fa$ fa37 (s1_37, c1_38, v[37], w[37], x[37]); fa$ fa38 (s1_38, c1_39, v[38], w[38], x[38]); fa$ fa39 (s1_39, c1_40, v[39], w[39], x[39]); fa$ fa40 (s1_40, c1_41, v[40], w[40], x[40]); fa$ fa41 (s1_41, c1_42, v[41], w[41], x[41]); fa$ fa42 (s1_42, c1_43, v[42], w[42], x[42]); fa$ fa43 (s1_43, c1_44, v[43], w[43], x[43]); fa$ fa44 (s1_44, c1_45, v[44], w[44], x[44]); fa$ fa45 (s1_45, c1_46, v[45], w[45], x[45]); fa$ fa46 (s1_46, c1_47, v[46], w[46], x[46]); fa$ fa47 (s1_47, c1_48, v[47], w[47], x[47]); fa$ fa48 (s1_48, c1_49, v[48], w[48], x[48]); fa$ fa49 (s1_49, c1_50, v[49], w[49], x[49]); fa$ fa50 (s1_50, c1_51, v[50], w[50], x[50]); fa$ fa51 (s1_51, c1_52, v[51], w[51], x[51]); fa$ fa52 (s1_52, c1_53, v[52], w[52], x[52]); fa$ fa53 (s1_53, c1_54, v[53], w[53], x[53]);

94

fa$ fa54 (s1_54, c1_55, v[54], w[54], x[54]); fa$ fa55 (s1_55, c1_56, v[55], w[55], x[55]); fa$ fa56 (s1_56, c1_57, v[56], w[56], x[56]); fa$ fa57 (s1_57, c1_58, v[57], w[57], x[57]); fa$ fa58 (s1_58, c1_59, v[58], w[58], x[58]); fa$ fa59 (s1_59, c1_60, v[59], w[59], x[59]); fa$ fa60 (s1_60, c1_61, v[60], w[60], x[60]); fa$ fa61 (s1_61, c1_62, v[61], w[61], x[61]); fa$ fa62 (s1_62, c1_63, v[62], w[62], x[62]); fa$ fa63 (s1_63, c1_64, v[63], w[63], x[63]); fa$ fa64 (s1_64, c1_65, v[64], w[64], x[64]); fa$ fa65 (s1_65, c1_66, v[65], w[65], x[65]); fa$ fa66 (s1_66, c1_67, v[66], w[66], x[66]); ha$ ha1_67 (s1_67, c1_68, w[67], x[67]); ha$ ha1_68 (s1_68, c1_69, w[68], x[68]); ha$ ha1_69 (s1_69, c1_70, w[69], x[69]); ha$ ha1_70 (s1_70, c1_71, w[70], x[70]); ha$ ha1_71 (s1_71, c1_72, w[71], x[71]); ha$ ha1_72 (s1_72, c1_73, w[72], x[72]); ha$ ha1_73 (s1_73, c1_74, w[73], x[73]); ha$ ha1_74 (s1_74, c1_75, w[74], x[74]);

ha$ ha3 (prod[1], c2_2, s1_1, c1_1);

95

ha$ ha4 (s2_2, bop[3], s1_2, c1_2); ha$ ha5 (aop[3], bop[4], s1_3, c1_3); ha$ ha6 (aop[4], bop[5], s1_4, c1_4); ha$ ha7 (aop[5], bop[6], s1_5, c1_5); fa$ fa2_1 (aop[6], bop[7], y[6], s1_6, c1_6); fa$ fa2_2 (aop[7], bop[8], y[7], s1_7, c1_7); fa$ fa2_3 (aop[8], bop[9], y[8], s1_8, c1_8); fa$ fa2_4 (s2_9, c2_10, y[9], s1_9, c1_9); fa$ fa2_10 (s2_10, c2_11, y[10], s1_10, c1_10); fa$ fa2_11 (s2_11, c2_12, y[11], s1_11, c1_11); fa$ fa2_12 (s2_12, c2_13, y[12], s1_12, c1_12); fa$ fa2_13 (s2_13, c2_14, y[13], s1_13, c1_13); fa$ fa2_14 (s2_14, c2_15, y[14], s1_14, c1_14); fa$ fa2_15 (s2_15, c2_16, y[15], s1_15, c1_15); fa$ fa2_16 (s2_16, c2_17, y[16], s1_16, c1_16); fa$ fa2_17 (s2_17, c2_18, y[17], s1_17, c1_17); fa$ fa2_18 (s2_18, c2_19, y[18], s1_18, c1_18); fa$ fa2_19 (s2_19, c2_20, y[19], s1_19, c1_19); fa$ fa2_20 (s2_20, c2_21, y[20], s1_20, c1_20); fa$ fa2_21 (s2_21, c2_22, y[21], s1_21, c1_21); fa$ fa2_22 (s2_22, c2_23, y[22], s1_22, c1_22); fa$ fa2_23 (s2_23, c2_24, y[23], s1_23, c1_23); fa$ fa2_24 (s2_24, c2_25, y[24], s1_24, c1_24);

96

fa$ fa2_25 (s2_25, c2_26, y[25], s1_25, c1_25); fa$ fa2_26 (s2_26, c2_27, y[26], s1_26, c1_26); fa$ fa2_27 (s2_27, c2_28, y[27], s1_27, c1_27); fa$ fa2_28 (s2_28, c2_29, y[28], s1_28, c1_28); fa$ fa2_29 (s2_29, c2_30, y[29], s1_29, c1_29); fa$ fa2_30 (s2_30, c2_31, y[30], s1_30, c1_30); fa$ fa2_31 (s2_31, c2_32, y[31], s1_31, c1_31); fa$ fa2_32 (s2_32, c2_33, y[32], s1_32, c1_32); fa$ fa2_33 (s2_33, c2_34, y[33], s1_33, c1_33); fa$ fa2_34 (s2_34, c2_35, y[34], s1_34, c1_34); fa$ fa2_35 (s2_35, c2_36, y[35], s1_35, c1_35); fa$ fa2_36 (s2_36, c2_37, y[36], s1_36, c1_36); fa$ fa2_37 (s2_37, c2_38, y[37], s1_37, c1_37); fa$ fa2_38 (s2_38, c2_39, y[38], s1_38, c1_38); fa$ fa2_39 (s2_39, c2_40, y[39], s1_39, c1_39); fa$ fa2_40 (s2_40, c2_41, y[40], s1_40, c1_40); fa$ fa2_41 (s2_41, c2_42, y[41], s1_41, c1_41); fa$ fa2_42 (s2_42, c2_43, y[42], s1_42, c1_42); fa$ fa2_43 (s2_43, c2_44, y[43], s1_43, c1_43); fa$ fa2_44 (s2_44, c2_45, y[44], s1_44, c1_44); fa$ fa2_45 (s2_45, c2_46, y[45], s1_45, c1_45); fa$ fa2_46 (s2_46, c2_47, y[46], s1_46, c1_46); fa$ fa2_47 (s2_47, c2_48, y[47], s1_47, c1_47);

97

fa$ fa2_48 (s2_48, c2_49, y[48], s1_48, c1_48); fa$ fa2_49 (s2_49, c2_50, y[49], s1_49, c1_49); fa$ fa2_50 (s2_50, c2_51, y[50], s1_50, c1_50); fa$ fa2_51 (s2_51, c2_52, y[51], s1_51, c1_51); fa$ fa2_52 (s2_52, c2_53, y[52], s1_52, c1_52); fa$ fa2_53 (s2_53, c2_54, y[53], s1_53, c1_53); fa$ fa2_54 (s2_54, c2_55, y[54], s1_54, c1_54); fa$ fa2_55 (s2_55, c2_56, y[55], s1_55, c1_55); fa$ fa2_56 (s2_56, c2_57, y[56], s1_56, c1_56); fa$ fa2_57 (s2_57, c2_58, y[57], s1_57, c1_57); fa$ fa2_58 (s2_58, c2_59, y[58], s1_58, c1_58); fa$ fa2_59 (s2_59, c2_60, y[59], s1_59, c1_59); fa$ fa2_60 (s2_60, c2_61, y[60], s1_60, c1_60); fa$ fa2_61 (s2_61, c2_62, y[61], s1_61, c1_61); fa$ fa2_62 (s2_62, c2_63, y[62], s1_62, c1_62); fa$ fa2_63 (s2_63, c2_64, y[63], s1_63, c1_63); fa$ fa2_64 (s2_64, c2_65, y[64], s1_64, c1_64); fa$ fa2_65 (s2_65, c2_66, y[65], s1_65, c1_65); fa$ fa2_66 (s2_66, c2_67, y[66], s1_66, c1_66); fa$ fa2_67 (s2_67, c2_68, y[67], s1_67, c1_67); fa$ fa2_68 (s2_68, c2_69, y[68], s1_68, c1_68); fa$ fa2_69 (s2_69, c2_70, y[69], s1_69, c1_69); fa$ fa2_70 (s2_70, c2_71, y[70], s1_70, c1_70);

98

fa$ fa2_71 (s2_71, c2_72, y[71], s1_71, c1_71); fa$ fa2_72 (s2_72, c2_73, y[72], s1_72, c1_72); fa$ fa2_73 (s2_73, c2_74, y[73], s1_73, c1_73); fa$ fa2_74 (s2_74, c2_75, y[74], s1_74, c1_74);

ha$ ha10 (prod[2], cin, s2_2, c2_2); ha$ ha11 (aop[9], bop[10], z[9], s2_9); fa$ fa3_10 (aop[10], bop[11], z[10], s2_10, c2_10); fa$ fa3_11 (aop[11], bop[12], z[11], s2_11, c2_11); fa$ fa3_12 (aop[12], bop[13], z[12], s2_12, c2_12); fa$ fa3_13 (aop[13], bop[14], z[13], s2_13, c2_13); fa$ fa3_14 (aop[14], bop[15], z[14], s2_14, c2_14); fa$ fa3_15 (aop[15], bop[16], z[15], s2_15, c2_15); fa$ fa3_16 (aop[16], bop[17], z[16], s2_16, c2_16); fa$ fa3_17 (aop[17], bop[18], z[17], s2_17, c2_17); fa$ fa3_18 (aop[18], bop[19], z[18], s2_18, c2_18); fa$ fa3_19 (aop[19], bop[20], z[19], s2_19, c2_19); fa$ fa3_20 (aop[20], bop[21], z[20], s2_20, c2_20); fa$ fa3_21 (aop[21], bop[22], z[21], s2_21, c2_21); fa$ fa3_22 (aop[22], bop[23], z[22], s2_22, c2_22); fa$ fa3_23 (aop[23], bop[24], z[23], s2_23, c2_23); fa$ fa3_24 (aop[24], bop[25], z[24], s2_24, c2_24); fa$ fa3_25 (aop[25], bop[26], z[25], s2_25, c2_25);

99

fa$ fa3_26 (aop[26], bop[27], z[26], s2_26, c2_26); fa$ fa3_27 (aop[27], bop[28], z[27], s2_27, c2_27); fa$ fa3_28 (aop[28], bop[29], z[28], s2_28, c2_28); fa$ fa3_29 (aop[29], bop[30], z[29], s2_29, c2_29); fa$ fa3_30 (aop[30], bop[31], z[30], s2_30, c2_30); fa$ fa3_31 (aop[31], bop[32], z[31], s2_31, c2_31); fa$ fa3_32 (aop[32], bop[33], z[32], s2_32, c2_32); fa$ fa3_33 (aop[33], bop[34], z[33], s2_33, c2_33); fa$ fa3_34 (aop[34], bop[35], z[34], s2_34, c2_34); fa$ fa3_35 (aop[35], bop[36], z[35], s2_35, c2_35); fa$ fa3_36 (aop[36], bop[37], z[36], s2_36, c2_36); fa$ fa3_37 (aop[37], bop[38], z[37], s2_37, c2_37); fa$ fa3_38 (aop[38], bop[39], z[38], s2_38, c2_38); fa$ fa3_39 (aop[39], bop[40], z[39], s2_39, c2_39); fa$ fa3_40 (aop[40], bop[41], z[40], s2_40, c2_40); fa$ fa3_41 (aop[41], bop[42], z[41], s2_41, c2_41); fa$ fa3_42 (aop[42], bop[43], z[42], s2_42, c2_42); fa$ fa3_43 (aop[43], bop[44], z[43], s2_43, c2_43); fa$ fa3_44 (aop[44], bop[45], z[44], s2_44, c2_44); fa$ fa3_45 (aop[45], bop[46], z[45], s2_45, c2_45); fa$ fa3_46 (aop[46], bop[47], z[46], s2_46, c2_46); fa$ fa3_47 (aop[47], bop[48], z[47], s2_47, c2_47); fa$ fa3_48 (aop[48], bop[49], z[48], s2_48, c2_48);

100

fa$ fa3_49 (aop[49], bop[50], z[49], s2_49, c2_49); fa$ fa3_50 (aop[50], bop[51], z[50], s2_50, c2_50); fa$ fa3_51 (aop[51], bop[52], z[51], s2_51, c2_51); fa$ fa3_52 (aop[52], bop[53], z[52], s2_52, c2_52); fa$ fa3_53 (aop[53], bop[54], z[53], s2_53, c2_53); fa$ fa3_54 (aop[54], bop[55], z[54], s2_54, c2_54); fa$ fa3_55 (aop[55], bop[56], z[55], s2_55, c2_55); fa$ fa3_56 (aop[56], bop[57], z[56], s2_56, c2_56); fa$ fa3_57 (aop[57], bop[58], z[57], s2_57, c2_57); fa$ fa3_58 (aop[58], bop[59], z[58], s2_58, c2_58); fa$ fa3_59 (aop[59], bop[60], z[59], s2_59, c2_59); fa$ fa3_60 (aop[60], bop[61], z[60], s2_60, c2_60); fa$ fa3_61 (aop[61], bop[62], z[61], s2_61, c2_61); fa$ fa3_62 (aop[62], bop[63], z[62], s2_62, c2_62); fa$ fa3_63 (aop[63], bop[64], z[63], s2_63, c2_63); fa$ fa3_64 (aop[64], bop[65], z[64], s2_64, c2_64); fa$ fa3_65 (aop[65], bop[66], z[65], s2_65, c2_65); fa$ fa3_66 (aop[66], bop[67], z[66], s2_66, c2_66); fa$ fa3_67 (aop[67], bop[68], z[67], s2_67, c2_67); fa$ fa3_68 (aop[68], bop[69], z[68], s2_68, c2_68); fa$ fa3_69 (aop[69], bop[70], z[69], s2_69, c2_69); fa$ fa3_70 (aop[70], bop[71], z[70], s2_70, c2_70); fa$ fa3_71 (aop[71], bop[72], z[71], s2_71, c2_71);

101

fa$ fa3_72 (aop[72], bop[73], z[72], s2_72, c2_72); fa$ fa3_73 (aop[73], bop[74], z[73], s2_73, c2_73); fa$ fa3_74 (aop[74], bop[75], z[74], s2_74, c2_74);

adder73b$ adder0 (prod[75:3], cout, {z[75], aop[74:3]}, {bop[75:3]}, cin); endmodule

module pareg1b$(out, pin, sin, pen, minitb, clk); input pin, sin, pen, minitb, clk; output out; wire muxout; wire set1=1’b1;

mux2$ mux0 (muxout, sin, pin, pen); samp_hold$ sh (out, muxout, set1, clk);

endmodule

module mul_radix8$(paout, multiplier, multiplicand, minit, clk); input clk; input minit; input[63:0] multiplier;

102

input[63:0] multiplicand; output [140:0] paout;

wire [75:0] prod, ppv; wire [669:0] ppout; wire [75:0] ppw, ppx, ppy, ppz; wire [66:0] ppw_temp, ppx_temp, ppy_temp, ppz_temp; wire pen_a; wire pen_prod; wire cin; wire set0=1’b0; wire [127:0] product; wire [17:0] mplier_w, mplier_x, mplier_y, mplier_z; wire [15:0] mplier_w_in, mplier_x_in, mplier_y_in, mplier_z_in; wire cout;

assign ppv =

{11’b0, paout[140:76]};

assign product = paout[139:12]; assign cout = paout[140]; assign mplier_w_in ={multiplier[54:52], multiplier[42:40], multiplier[30:28], multiplier[18:16], multiplier[6:4], multiplier[0]}; assign mplier_x_in ={multiplier[57:55], multiplier[45:43],

103

multiplier[33:31], multiplier[21:19], multiplier[9:7], multiplier[1]}; assign mplier_y_in ={multiplier[60:58], multiplier[48:46], multiplier[36:34], multiplier[24:22], multiplier[12:10], multiplier[2]}; assign mplier_z_in ={multiplier[63:61], multiplier[51:49], multiplier[39:37], multiplier[27:25], multiplier[15:13], multiplier[3]};

seqcon_radix8$ seqcon (pen_a, pen_prod, iclk2, iclk3, iclk4, iclk5, minit, multiplier[3:1], clk); pareg128b$ pareg (paout, prod, multiplier, c76, pen_a, pen_prod, clk); ppreg$ ppreg (ppout, multiplicand, iclk2); mplier_shft16b$ mplier0 (mplier_w, mplier_w_in, pen_a, clk); mplier_shft16b$ mplier1 (mplier_x, mplier_x_in, pen_a, clk); mplier_shft16b$ mplier2 (mplier_y, mplier_y_in, pen_a, clk); mplier_shft16b$ mplier3 (mplier_z, mplier_z_in, pen_a, clk); mux8_67$ mux0 (ppw_temp, 67’b0, ppout[133:67], ppout[200:134], ppout[267:201], ppout[334:268], ppout[401:335], ppout[468:402], ppout[535:469], mplier_w[2:0]); mux8_67$ mux1 (ppx_temp, 67’b0, ppout[133:67], ppout[200:134], ppout[267:201], ppout[334:268], ppout[401:335],

104

ppout[468:402], ppout[535:469], mplier_x[2:0]); mux8_67$ mux2 (ppy_temp, 67’b0, ppout[133:67], ppout[200:134], ppout[267:201], ppout[334:268], ppout[401:335], ppout[468:402], ppout[535:469], mplier_y[2:0]); mux8_67$ mux3 (ppz_temp, 67’b0, ppout[133:67], ppout[200:134], ppout[267:201], ppout[334:268], ppout[401:335], ppout[468:402], ppout[535:469], mplier_z[2:0]); mux2_75$ mux4 (ppw, {9’b0, ppw_temp}, {3’b0, ppw_temp, 6’b0}, iclk2); mux2_75$ mux5 (ppx, {6’b0, ppx_temp, 3’b0}, {2’b0, ppx_temp, 7’b0}, iclk2); mux2_75$ mux6 (ppy, {3’b0, ppy_temp, 6’b0}, {1’b0, ppy_temp, 8’b0}, iclk2); csa_5op$ csa0 (c76, prod, ppv, ppw, ppx, ppy, {ppz_temp,9’b0}); endmodule

105

Bibliography

[1] Jonathan P. Bowen, He Jifeng, and Xu Qiwen. An animatable operational semantics of the Verilog Hardware Description Language. In John A. McDermid Shaoying Liu and Michael G. Hinchey, editors, Proc. ICFEM 2000: 3rd IEEE International Conference on Formal Engineering Methods, pages 199–207. IEEE Computer Society Press, 2000. [2] Robert S. Boyer and J. Strother Moore. Program verification. Journal of Automated Reasoning, 1(1):17–23, 1985. [3] R. E. Bryant. On the complexity of vlsi implementations and graph representations of boolean functions with application to integer multiplication. IEEE Transactions on Computers, 40(2):205–213, 1991. [4] R. E. Bryant and Yirng-An Chen. Verification of arithmetic circuits with

binary moment diagrams. In Design Automation Conference, pages 535–541, 1995. [5] J. R. Burch. Using bdds to verify multipliers. In Proceedings of the 28th conference on ACM/IEEE design automation conference, pages 408–412. ACM Press, 1991. [6] E. M. Clarke, M. Fujita, and X. Zhao. Hybrid decision diagrams. In Proceedings of the 1995 IEEE/ACM international conference on Computer-aided 106

design, pages 159–163, 1995. [7] D. Cyrluk. Microprocessor Verification in PVS: A Methodology and Simple Example. Technical Report SRI-CSL-93-12, Menlo Park, CA, 1993. [8] D. Kapur and M. Subramaniam. Mechanically verifying a family of multiplier circuits. In Rajeev Alur and Thomas A. Henzinger, editors, Proceedings

of the Eighth International Conference on Computer Aided Verification CAV, volume 1102, pages 135–146, New Brunswick, NJ, USA, / 1996. Springer Verlag. [9] D. M. Russinoff. A Mechanically Checked Proof of IEEE Compliance of a Register-Transfer-Level Specification of the AMD-K7 Floating-Point Multiplication, Division, and Square Root Instructions. In LMS Journal of Computation and Mathematics, volume 1, pages 148–200, December 1998. [10] Nachum Dershowitz. A taste of rewrite systems. In Functional Programming, Concurrency, Simulation and Automated Reasoning, pages 199–228, 1993. [11] VHDL Synthesis Interoperability Working Group. Ieee p1076.6/d2.01 draft standard for vhdl register transfer level synthesis. [12] H. Anderson, P. Williams, and H. Hulgaard. Equivalence Checking of Combinational Circuits using Boolean Expression Diagrams. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 18(7), 1999.

107

[13] J. Hoe and Arvind. Hardware synthesis from term rewriting systems. In X IFIP International Conference on VLSI (VLSI 99), Lisbon, Portugal, November 1999. [14] W. A. Hunt. FM8501: A Verified Microprocessor. PhD thesis, University of Texas at Austin, 1985. [15] Texas Instruments. SN74181, SN74LS181, SN74S181 Arithmetic Logic

Units/ Function Generators. Bulletin No. DL-S 7611831, December 1972. [16] J. Field. A simple rewriting semantics for realistic imperative programs and its application to program analysis. In ACM SIGPLAN Workshop on Par-

tial Evaluation and Semantics-Based Program Manipulation, pages 98–107, 1990. [17] D. Kapur. Theorem proving support for hardware verification. In Third

Intl. Workshop on First-Order Theorem Proving (FTP 2000), St. Andrews, Scotland, July 2000. [18] D. Kapur and H. Zhang. An overview of Rewrite Rule Laboratory (RRL). J. Computer and Mathematics with Applications, 29(2):91–114, 1995. [19] M. Kaufmann and J. Moore. ACL2: An industrial strength version of nqthm. In Compass’96: Eleventh Annual Conference on Computer Assurance, page 23, Gaithersburg, Maryland, 1996. National Institute of Standards and Technology. [20] M. Kaufmann. Personal Communication. 108

[21] J. Klop. Term Rewriting Systems. In In S. Abramsky, D. M. Gabbay, and T. S. E. Maibaum, editors: Handbook of Logik in Computer Science, Oxford University Press, volume 2, pages 1–116, 1992. [22] M. Kaufmann and D. Russinoff. Verification of Pipeline Circuits. In ACL2 Workshop 2000 (proceedings are available as UTCS Technical Report TR-0029), October 2000. [23] Z. Manna, N. Bjorner, A. Browne, E. Y. Chang, M. Colon, L. de Alfaro, H. Devarajan, A. Kapur, J. Lee, H. Sipma, and T. E. Uribe. Step: The stanford temporal prover. In TAPSOFT, pages 793–794, 1995. [24] IEEE 1364-2001 Standard Verilog Language Reference Manual. [25] Y. Matsunaga. An Efficient Equivalence Checker for Combinational Circuits. In Proceedings of Design Automation Conference, pages 629–634, 1996. [26] J. Sawada and W. A. Hunt. Processor verification with precise exceptions and speculative execution. In Proc. 10th International Computer Aided Verification Conference, pages 135–146, 1998. [27] X. Shen. Design and Verification of Speculative Processors. In Proceedings of the Workshop on Formal Techniques for Hardware and Hardware-like Systems, Marstrand, Sweden, June 1998. [28] J. Strother Moore. Personal Communication.

109

[29] Li Yongjian and He Jifeng. Towards a theory of bisimulation for a fragment of verilog. In International Parallel and Distributed Processing Symposium (IPDPS’03), pages 22 – 26, April 2003. [30] H. Yu and J. A. Abraham. An Efficient 3-bit-scan Multiplier without Overlappong Bits, and its 64X64 Bit Implementation. In Proceedings of 7th Asia and South Pacific Design Automation Conference, January 2002. [31] Z. Zhou, X. Song, F. Corella, E. Cerny, and M. Langevin. Description and verification of RTL designs using multiway decision graphs. In Proceedings of the Conference on Hardware Description Languages, 1995. [32] Zhu Huibiao, Jonathan P. Bowen, and He Jifeng. Soundness, completeness and non-redundancy of operational semantics for Verilog based on denotational semantics. In Chris George and Huaikou Miao, editors, Formal Methods and Software Engineering, ICFEM 2002: 4th International Conference on Formal Engineering Methods, volume 2495 of Lecture Notes in Computer Science, pages 600–612. Springer-Verlag, 21–25 October 2002. Extended version to be available as Technical Report SBU-CISM-02-07, SCISM, South Bank University, London, UK, 2002.

110

Vita

Shobha Vasudevan did her Bachelors in Computer Engineering from the University of Mumbai, India. She is currently in the PhD program with Dr. Jacob Abraham. Her interests are formal verification of RT-Level designs, verification of C-level specifications, software verification techniques and their application to hardware.

Permanent address: 2, YASHODAN, Dinshaw Waccha Road, Mumbai-400020. INDIA.

A This thesis was typeset with LTEXÝ by the author.

EX is a document preparation system developed by Leslie Lamport as a special version of Donald Knuth’s TEX Program.

Ý LT A

111

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close