# Sense Amplifier Half-Buffer (SAHB): A Low-Power High-Performance Asynchronous Logic QDI Cell Template Kwen-Siong Chong, Senior Member, IEEE, Weng-Geng Ho, Student Member, IEEE, Tong Lin, Member, IEEE, Bah-Hwee Gwee, Senior Member, IEEE, and Joseph S. Chang, Senior Member, IEEE Abstract—We propose a novel asynchronous logic (async) quasi-delay-insensitive (QDI) sense-amplifier half-buffer (SAHB) cell design approach, with emphases on high operational robustness, high speed, and low power dissipation. There are five key features of our proposed SAHB. First, the SAHB cell embodies the async QDI 4-phase $(4\phi)$ signaling protocol to accommodate process-voltage-temperature variations. Second, the sense amplifier (SA) block in SAHB cells embodies a cross-coupled latch with a positive feedback mechanism to speed up the output evaluation. Third, the evaluation block in the SAHB comprises both nMOS pull-up and pull-down networks with minimum transistor sizing to reduce the parasitic capacitance. Fourth, both the evaluation block and SA block are tightly coupled to reduce redundant internal switching nodes. Fifth, the SAHB cell is designed in CMOS static logic and hence appropriate for fullrange dynamic voltage scaling operation for V<sub>DD</sub> ranging from nominal voltage (1 V) to subthreshold voltage ( $\sim$ 0.3 V). When six library cells embodying our proposed SAHB are compared with those embodying the conventional async QDI precharged halfbuffer (PCHB) approach, the proposed SAHB cells collectively feature simultaneous ~64% lower power, ~21% faster, and ~6% smaller IC area; the PCHB cell is inappropriate for subthreshold operation. A prototype 64-bit Kogge-Stone pipeline adder based on the SAHB approach (at 65 nm CMOS) is designed. For a 1-GHz throughput and at nominal V<sub>DD</sub>, the design based on the SAHB approach simultaneously features ~56% lower energy and ~24% lower transistor count advantages than its PCHB counterpart. When benchmarked against the ubiquitous synchronous logic counterpart, our SAHB dissipates ~39% lower energy at the 1-GHz throughput. Manuscript received September 16, 2015; revised February 24, 2016 and April 29, 2016; accepted June 7, 2016. This work was supported by the Agency for Science, Technology and Research, Singapore, within the Science and Engineering Research Council 2013 through the Public Sector Research Funding under Grant SERC1321202098. (Corresponding author: Weng-Geng Ho.) K.-S. Chong, W.-G. Ho, and T. Lin are with Temasek Laboratories, Nanyang Technological University, Singapore 637553 (e-mail: kschong@ntu.edu.sg; wgho@ntu.edu.sg; lintong@ntu.edu.sg). B.-H. Gwee and J. S. Chang are with the Virtus, IC Design Centre of Excellence, School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798 (e-mail: ebhgwee@ntu.edu.sg; ejschang@ntu.edu.sg). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2016.2583118 #### I. Introduction ITH the ever-increasing complexity of system-onchip (SoC) [1], it is likewise increasingly challenging for designers to simultaneously meet the many circuit/design constraints, including power dissipation, throughput, reliability, etc. [2]. To design an SoC (and its associated speed-/ power-aware circuitry) [3], the exploitation of asynchronous logic (async) [4] as opposed to the fully synchronous logic (sync), or partially using globally asynchronouslocally synchronous [5], [40], is increasingly popular due to its self-timed nature, potentially resulting in lower power dissipation (e.g., low idle power dissipation) and more robust synchronization [more timing tolerance to process-voltagetemperature (PVT) variations [6]. Not unexpectedly, the International Technology Roadmap for Semiconductors [7] projects that the adoption of async circuits in SoCs will increase from the current $\sim$ 22% to $\sim$ 54% in 2026. Fig. 1 broadly classifies digital logic for the realization of operationally robust digital circuits. In the highest classification, there are the sync and async digital logic design philosophies. As the sync digital logic design philosophy requires timing assumptions associated with the clock (e.g., clock skews and setup/hold times) [8], realizing operationally robust circuits under large PVT variations is challenging, where large timing margins are required to accommodate the worst case conditions. In contrast, the async digital logic design philosophy, particularly the quasi-delay-insensitive (QDI) approach, is an alternative approach to mitigate the timing assumptions. There are nevertheless other challenges and will be discussed in the following two paragraphs. In Fig. 1, the classifications within the async digital logic design philosophy are depicted. In the perspective of the timing approach classification, there are three async types [38]: 1) delay-insensitive (DI); 2) bundled-data (BD); and 3) QDI/timed-pipeline (TP)/single-track (ST). For the first in this classification, the DI circuits, they are largely impractical because they make no assumption on the gate/wire delays, leading to circuit realizations comprising only buffer cells and C-Muller cells [9]. For the second approach, BD circuits, they are similar to sync circuits, requiring delay assumptions for circuit realization. As their operations rely on bounded gate/wire delays similar to sync circuits, their design is somewhat challenging to guarantee operational robustness in unknown Fig. 1. General classification of digital logic circuits. operating conditions. For the third approach, QDI, TP, and ST circuits, they are grouped together for their similar completion detection mechanisms [10], [41]. QDI circuits operate error free for arbitrary wire delays and assume isochronic forks [11], i.e., the same wire delays are assumed for different branches. This assumption can be satisfied easily in the placement and routing stage. On the other hand, although TP circuits and ST circuits have completion detection mechanisms, they require delay assumptions for their circuit realizations. These delay assumptions consequently reduce the reliability of their circuits for unknown operating conditions. In short, as the QDI async approach detects the completion of data according to actual workloads and/or operating conditions, it offers the most practical approach [12]–[15] to accommodate unknown PVT variations. Within the QDI/TP/ST approaches, there are two general async pipeline structures: 1) data control decomposition (DCD) and 2) integrated-latch (IL) pipelines [21], [22]. The DCD pipeline separates async controllers and datapaths where they can be designed individually, i.e., largely independently. Such pipelines are simple but less speed efficient [23] as the bulk of cells would be grouped together to form a block-level pipeline stage, hence resulting in a longer critical path. The QDI cell design approaches with the DCD pipeline include delay-insensitive minterm synthesis [12], null convention logic [13], and precharged static logic [14]. On the other hand, the IL pipeline is more speed efficient since it incorporates a pair of async controller and logic cell to form a microcell pipeline stage, hence resulting in a shorter critical path. The QDI cell design approaches with the IL pipeline, as depicted in the last row of Fig. 1, include our sense amplifier half-buffer (SAHB) and the reported precharged half-buffer (PCHB) [15]. The SAHB is the proposed work in this paper and will be described later in Section II. For completeness, there are other async cell design approaches, including PS0 [16], LP2/1 [17], single-track asynchronous pulse logic (STAPL) [18], single-track full-buffer (STFB) [19], and sense-amplifier-based pass transistor logic (SAPTL) [20]. As these reported approaches require delay assumptions, their operational robustness are compromised; they are classified as either TP or ST [37], [38]. Their async operation modality and the delay assumptions of their cell templates have been reported [39]. In this paper, we propose the SAHB—a novel QDI cell design approach—with emphases on high operational robustness, high speed, and low energy dissipation. There are several novel noteworthy features in SAHB. First, the SAHB cell incorporates an evaluation block and a sense amplifier (SA) block [24] to perform an async 4-phase QDI operation [11], thereby accommodating the timing issue in the presence of PVT variations, including unknown variations. Second, the SA block embodies a cross-coupled latch with a positive feedback mechanism to speed up and latch the output. Third, both the pull-up and pull-down networks in the evaluation block comprise only nMOS (instead of CMOS) transistors to reduce the parasitic capacitance and hence the ensuing power dissipation. Fourth, the evaluation block and SA block are tightly coupled to reduce the number of switching nodes, resulting in short cycle time and low power dissipation. Fifth, the SAHB cell is realized in static logic style and hence appropriate for full-range dynamic voltage scaling (FDVS) for supply voltages ranging from nominal voltage (1 V) to deep subthreshold (0.3 V) [25]. On the basis of six library cells (i.e., buffer, two-input AND/NAND, two-input OR/NOR, two-input XOR/XNOR, two-input MUX/IMUX, and threeinput AO/AOI) at 1 V, 65-nm CMOS process, the proposed SAHB approach outperforms the reported competing PCHB approach; the PCHB library cells, on average, dissipate $2.8\times$ more power, suffer $1.27\times$ longer delay, and occupy $1.06 \times$ larger area than the SAHB library cells. In this paper, we further describe a 64-bit Kogge–Stone (KS) pipeline adder embodying the proposed SAHB approach for a power management application. Our SAHB pipeline adder is experimentally verified to be operationally robust within a wide supply voltage range (0.3 to 1.4 V) and wide temperature range (-40 °C to 100 °C). When benchmarked against its competing async PCHB and sync equivalents (at 1-GHz throughput), our SAHB pipeline adder is more energy efficient; the PCHB and the sync counterpart dissipates 2.29× and 1.65× higher energy, respectively. The PCHB pipeline adder further suffers from 1.31× higher transistor count, translating to an estimated 7% larger IC area. This paper is organized as follows. Section II presents the proposed SAHB cell design approach and its attributes are benchmarked against reported competing async cell design approaches. Section III describes the 64-bit SAHB pipeline Fig. 2. SAHB cell template. adder and its attributes are benchmarked against the async PCHB and the sync counterparts. Finally, conclusions are drawn in Section IV. ## II. PROPOSED SENSE AMPLIFIER HALF-BUFFER #### A. Sense Amplifier Half-Buffer Fig. 2 depicts the generic interface signals for the proposed dual-rail SAHB cell template. The data inputs are Datain and $nData_{in}$ and the data outputs are Q.T/Q.F and nQ.T/nQ.F. The left-channel handshake outputs are $L_{ack}$ and $nL_{ack}$ , and the right-channel handshake inputs are $R_{ack}$ and $nR_{ack}$ . $nData_{in}$ , nQ.T, nQ.F, nLack, and nRack are logical complementary signals to the primary input/output signals of Datain, Q.T, Q.F, $L_{ack}$ , and $R_{ack}$ , respectively. For the sake of brevity, we will only use the primary input/output signals to delineate the operations of an SAHB cell. The SAHB cell strictly abides by the async 4-phase $(4\phi)$ handshake protocol—having two alternate operation sequences, evaluation and reset. Initially, $L_{\rm ack}$ and $R_{\rm ack}$ are reset to 0 and both Data<sub>in</sub> and Q.T/Q.Fare empty, i.e., both of the rails in each signal are 0. During the evaluation sequence, when Datain is valid (i.e., one of the rails in each signal is 1) and $R_{ack}$ is 0, Q.T/Q.F is evaluated and latched and $L_{ack}$ is asserted to 1 to indicate the validity of the output. During the reset sequence, when Datain is empty and $R_{ack}$ is 1, Q.T/Q.F will then be empty and $L_{ack}$ is deasserted to 0. Subsequently, the SAHB cell is ready for the next operation. For illustration, Fig. 3(a) and (b) depicts the respective circuit schematic of an evaluation block and an SA block of a buffer cell embodying SAHB; the various sub-blocks are shown within the dotted blocks. The evaluation and SA blocks are powered, respectively, by $V_{\rm DD\_L}$ and by $V_{\rm DD}$ , which can be the same or different voltages (see Section II-B). The nMOS transistor in green with *RST* is optional for cell initialization. In Fig. 3(a), the evaluation block comprises an nMOS pull-up network and an nMOS pull-down network to, respectively, evaluate and reset the dual-rail output Q.T/Q.F. Of particular interest, the nMOS pull-up network features low parasitic capacitance (lower than the usual pMOS pull-up network whose transistor sizing is often $2 \times$ larger than that of the nMOS). From a structural view, the ensuing signals function as follows. Consider first the nMOS pull-up network where Q.T/Q.F is evaluated based on the data input (i.e., A.T/A.F), and $nR_{\rm ack}$ serves as an evaluation flow control signal. The nMOS pull-up network realizes the buffer logic function as expressed in (1). To reduce the short-circuit current, nQ.T/nQ.F will disconnect the evaluation function when Q.T/Q.F is evaluated $$Q.T = A.T; \quad Q.F = A.F. \tag{1}$$ Consider now the nMOS pull-down network where Q.T/Q.F is reset depending on the data input. For a single-input buffer Fig. 3. Circuit schematic of a buffer cell embodying SAHB. (a) Evaluation block powered by $V_{\rm DD\_L}$ . (b) SA block powered by $V_{\rm DD}$ . cell depicted in Fig. 3, the transistor configuration of A.T/A.F in the pull-up network is a series—parallel topology to the transistor configuration of nA.T/nA.F in the pull-down network. For examples of two-input and three-input cells, Fig. 4 depicts two-input AND/NAND, two-input XOR/XNOR, and three-input AO/AOI cells. Their series—parallel pairs are marked with \* and # for the # path and # path, respectively. In Figs. 3 and 4, # ack serves as the reset flow-control signal, connecting in series the data input (transistors marked with $\land$ ) for input completeness [37]. In Fig. 3(b), the SA block comprises an SA cross-coupled latch, complementary buffers, and a completion circuit. The SA cross-coupled latch amplifies and latches Q.T/Q.F. The complementary buffers and completion circuit, respectively, generate the complementary output signals (nQ.T/nQ.F) and the left-channel handshake signals $(L_{ack}/nL_{ack})$ . From a structural view, the cross-coupled inverters serve as an amplifier where in the reset phase, both *O.T* and *O.F* Fig. 4. Dual-rail SAHB library cells. (a) Two-input AND/NAND. (b) Two-input XOR/XNOR. (c) Three-input AO/AOI. are 0 and $V_{\rm DD_V}$ is floating. During the evaluation phase, Q.T and Q.F will develop a small voltage difference and when $V_{\rm DD_V}$ is connected to $V_{\rm DD}$ , the cross-coupled inverters will amplify (in a positive feedback mechanism) the voltage difference between Q.T and Q.F. To realize the input completeness feature, the top left branch in the SA cross-coupled latch detects if all inputs (i.e., nA.T and nA.F) are ready and $R_{\rm ack}=0$ . For bistable operation, the top-right branch (within the dotted oblong circle) in the SA cross-coupled latch holds the output until all inputs are empty and $R_{\rm ack}=1$ . Initially, A.T, A.F, $R_{\rm ack}$ , and $L_{\rm ack}$ are 0 and nA.T, nA.F, $nR_{\rm ack}$ and $nL_{\rm ack}$ are 1. During the evaluation phase, for example, when A.F = 1 (nA.F = 0), the voltage at node Q.F is partially charged to $V_{\rm DD\_L}$ by the nMOS pull-up network in the evaluation block and Q.T remains as 0 (via the nMOS pull-down network). As the input is now valid, the SA cross-coupled latch is turned ON by connecting the virtual supply $V_{\rm DD_V}$ to $V_{\rm DD}$ and amplifies Q.F to 1. Q.F is thereafter latched (together with the pMOS bistable transistors and the cross-coupled inverters) and nQ.F becomes 0 (to disconnect the node Q.F from the $V_{\rm DD\_L}$ in the evaluation block to prevent any short-circuit current). $L_{\rm ack}$ is asserted to 1 $(nL_{\rm ack}=0)$ to indicate the validity of the dual-rail output. During the reset phase, the input is empty (nA.T) and nA.F are 1) and $R_{\rm ack}=1$ , the dual-rail output becomes empty and $L_{\rm ack}$ is deasserted to 0. At this juncture, the SA block is ready for a new operation. Note that both the evaluation block and SA block are tightly coupled to reduce the number of switching nodes, thereby enhancing the speed and reducing the power dissipation. Furthermore, as both the evaluation and SA blocks operate in static logic style, their transistor sizings are not critical. The SAHB cell strictly abides by the QDI protocols—gateorphan-free with the isochronic fork assumption [9] and input completeness [11]. Consider first the timing constraints of the gate-orphan-free in the evaluation block and SA block. As the evaluation block is acknowledged by input signals A.T and A.F (with input completeness) and handshake signal $R_{\rm ack}$ (and $nR_{\rm ack}$ ), it is hence gate-orphan-free. In addition, as the SA block is only triggered when all the inputs are ready (i.e., input completeness), it is hence gate-orphan-free. Second, consider the input completeness constraints, which involve the evaluation phase and reset phase. The evaluation phase of the SA block commences only when all the inputs are valid. The reset phase of the pull-down network in the evaluation block (and subsequently the SA block) commences only when all the inputs are empty. The SAHB cell is hence input complete. Fig. 4(a)–(c) depicts the circuit schematic of three basic SAHB library cells: 1) two-input AND/NAND; 2) two-input XOR/XNOR; and 3) three-input AOI/AOI cells. The logic functions of the pull-up network for AND/NAND, XOR/XNOR, and AO/AOI cells are, respectively, expressed in (2), (3), and (4). Similar to the buffer cell, the structure of the evaluation block and SA block of these cells are constructed based on their logic functions and input signals. These library cells will be used for benchmarking and for realizing the 64-bit SAHB pipeline adder (see Section III) $$Q.T = A.T \cdot B.T$$ $$Q.F = A.F \cdot B.F$$ $$Q.T = A.T \cdot B.F + A.F \cdot B.T$$ (2) $$Q.F = A.T \cdot B.T + A.F \cdot B.F \tag{3}$$ $$Q.T = A.T \cdot B.T + C.T$$ $$Q.F = (A.F \cdot B.F) \cdot C.F. \tag{4}$$ #### B. Circuit Configuration and Supply Voltage Setup In the evaluation block, there are two ways to configure the connection of the transistors for a multiple-input SAHB cell. Fig. 5(a) and (b) depicts two different circuit configurations for Q.F of the two-input AND/NAND SAHB cell. Of these circuit configurations, the configuration in Fig. 5(a) is adopted in the cell library for its lesser transistor count, where Q.F will be partially charged up to $V_{\rm DD\_L}$ when either A.F or B.F is 1. The voltage level of voltage supplies $V_{\rm DD\_L}$ and $V_{\rm DD}$ is critical to prevent an early output transition before all the inputs (A.F and B.F) are valid. The circuit configuration in Fig. 5(b) requires higher transistor count, but it ensures that Fig. 5. Circuit configurations in a two-input SAHB AND/NAND cell. (a) Transistors are shared and (b) transistors are not shared. The drawings depict the scenario when only input A is valid. current I is conducted only when all the inputs are valid [37]. In this fashion, the voltage level of voltage supplies $V_{\rm DD\_L}$ and $V_{\rm DD}$ is not critical and they can be connected to the same voltage source. We will now further elaborate on the voltage condition in SAHB cells using the circuit configuration shown in Fig. 5(a). Assuming that Q.F is initially reset at 0 V, we express the switching threshold voltage $V_X$ in the following, which causes the inverter (in the SA block) to switch: $$V_X \approx \frac{k \cdot V_{\rm DD}}{l+k} \tag{5}$$ where k = the pMOS over nMOS transistor width and l = the electron over hole saturation mobility. Assuming $nR_{\rm ack} = V_{\rm DD}$ , $A.F = V_{\rm DD}$ , and $nQ.F = V_{\rm DD}$ , as depicted in Fig. 5(a), the voltage at Q.F ( $V_{Q.F}$ ) can be expressed in $$\begin{split} &V_{Q.F} \\ &= \begin{cases} V_{\mathrm{DD}} - V_{\mathrm{tn}}, & \text{if } V_{\mathrm{DD}} \leq V_{\mathrm{DD\_L}} \\ V_{\mathrm{DD\_L}} - V_{\mathrm{tn}}, & \text{if } V_{\mathrm{DD}} \geq V_{\mathrm{DD\_L}} & \& V_{\mathrm{DD}} - V_{\mathrm{DD\_L}} \leq V_{\mathrm{tn}} \\ V_{\mathrm{DD\_L}}, & \text{if } V_{\mathrm{DD}} \geq V_{\mathrm{DD\_L}} & \& V_{\mathrm{DD}} - V_{\mathrm{DD\_L}} > V_{\mathrm{tn}}. \end{cases} \end{split}$$ When input A is valid, the current I will charge Q.F despite the input B being empty. Hence, $V_{Q.F}$ must be smaller than $V_X$ in order to prevent the dual-rail output from being valid, as expressed in (7). Otherwise, the SAHB cell may operate too early, potentially violating the transition sequences with its neighboring SAHB cells $$V_{O.F} < V_X. \tag{7}$$ For $V_{\rm DD}=1$ V, $l\approx 3$ , and a 65-nm CMOS process where $V_{\rm tn}=0.35$ V, we design the inverter with $k\approx 1.6$ [hence $V_X\approx 0.35$ V as ascertained from (5)] and setting $V_{\rm DD\_L}\leq 0.3$ V where $V_{Q.F}$ [ $\leq 0.3$ V as ascertained from (6)] is lower than $V_X$ , thus fulfilling the condition in (7). Since the evaluation block is not speed critical, a lower voltage for $V_{\rm DD\_L}$ will not compromise the overall speed but desirably reduces the leakage power dissipation. It is interesting to note that PVT variations may cause an SAHB cell to operate too early if $V_{Q.F}$ is set too close to $V_X$ . To ascertain this analytically, we formulate the voltage conditions in (8) by combining (6) and (7), where $V_{\rm DD}$ , $V_{\rm tn}$ , and l are constituents of PVT variations. We can show that by setting $V_{\rm DD\_L} = 0.3$ V (0.05 V less than $V_X$ ), the ensuing conditions are less sensitive to PVT variations $$\frac{V_{\mathrm{DD}}}{V_{\mathrm{tn}}} < \frac{l+k}{l}, \quad \text{if } V_{\mathrm{DD}} \leq V_{\mathrm{DD\_L}} \quad (8a)$$ $$\left(V_{\mathrm{DD\_L}} - \frac{k}{l+k}.V_{\mathrm{DD}}\right) < V_{\mathrm{tn}}, \quad \text{if } V_{\mathrm{DD}} \geq V_{\mathrm{DD\_L}}$$ $$& \quad \& V_{\mathrm{DD}} - V_{\mathrm{DD\_L}} \leq V_{\mathrm{tn}} \quad (8b)$$ $$& \quad \frac{V_{\mathrm{DD\_L}}}{V_{\mathrm{DD}}} < \frac{k}{l+k}, \quad \text{if } V_{\mathrm{DD}} \geq V_{\mathrm{DD\_L}}$$ $$& \quad \& V_{\mathrm{DD}} - V_{\mathrm{DD\_L}} > V_{\mathrm{tn}}. \quad (8c)$$ #### C. Transistor Sizing Optimization and Layout Fig. 6(a) depicts the general timing characteristics of an SAHB cell. As depicted in Fig. 6(a), the forward delay $t_F$ is the time duration when Data<sub>in</sub> is valid (and $R_{\rm ack}=0$ ) until $L_{\rm ack}$ is asserted (during the evaluation phase). The backward delay $t_B$ is the time duration when Data<sub>in</sub> is empty (and $R_{\rm ack}=1$ ) until $L_{\rm ack}$ is deasserted (during the reset phase). For completeness, note that $2*t_F+2*t_B$ constitutes the shortest delay of an SAHB cell [42]. With reference to the SAHB cell [see Fig. 3(a) and (b)], Fig. 6(b) depicts a possible critical path $t_F$ that asserts Q.T until $L_{\rm ack}$ is asserted. For $t_F$ , the sizing of pMOS transistors along the critical path is important for high-speed operation. The critical path occurs within the SA block, i.e., the SA crosscoupled inverters will be switched ON only when all the inputs are valid for input completeness. On the other hand, Fig. 6(c) depicts a possible critical path $t_B$ that deasserts Q.T until $L_{\rm ack}$ is deasserted. For $t_B$ , the sizing of nMOS transistors along the critical path is equally important for high-speed operation. The sizings of the critical pMOS and nMOS transistors are 410/60 nm and 270/60 nm, respectively. For simplicity, the sizings for these respective pMOS and nMOS transistors are henceforth denoted by $1 \times$ transistor sizing. Fig. 6. Timing characteristics for SAHB cell. (a) Timing diagram. (b) Possible critical path of $t_F$ in the SA block. (c) Possible critical path of $t_R$ in the evaluation block. Fig. 7 depicts the delay, power dissipation, and diffusion area of an SAHB cell with respect to the transistor sizing. For the SAHB with $1\times$ transistor sizing ( $1\times$ along the abscissa; pMOS and nMOS are the aforesaid $1\times$ transistor sizing), the corresponding 1 normalized reading (1 along the ordinate) for the delay, power dissipation, and diffusion area for the SAHB cell are 294 ps, 7.12 $\mu$ W, and 23 $\mu$ m<sup>2</sup>, respectively. For the SAHB with $6\times$ transistor sizing embodying pMOS and nMOS transistors whose relative sizing is $6\times$ , the corresponding delay, power dissipation, and diffusion area of this Fig. 7. Normalized parameters of an SAHB cell at different transistor sizings. $1 \times$ delay of 294 ps, $1 \times$ power dissipation of 7.12 $\mu$ W, and $1 \times$ diffusion area of 23 $\mu$ m<sup>2</sup> of $V_{DD} = 1$ V and $V_{DD}$ L = 0.3 V at 1 GHz. SAHB are $0.7 \times$ , $1.95 \times$ , and $3.15 \times$ that of the aforesaid SAHB with $1 \times$ transistor sizing. To mitigate the power and area overheads, we adopt $1\times$ transistor sizing for the pMOS and nMOS transistors in the critical path. For completeness, other pMOS and nMOS transistors, which are not in the critical path [in Fig. 6(b) and (c)], are sized to minimum sizing of 205/60 nm and 135/60 nm, respectively. Fig. 8(a) depicts the layout view of the SAHB cell whose total area is 5 $\mu$ m $\times$ 4.6 $\mu$ m. We implement our SAHB library cells based on the fixed-height standard cell approach [26]. The height of the cells is fixed at 5 $\mu$ m and their width is a multiple of 0.2 $\mu$ m (depending on their complexity) with a minimum width of 1.4 $\mu$ m. Fig. 8(b) depicts some geometry distances/rules such that without violating any design rules, our SAHB library cells can be placed together. At both edges, the widths of both the pMOS guard ring (n+) and nMOS guard ring (p+) are 0.355 $\mu$ m. The widths of the n-well (for pMOS transistors) and p-substrate (for nMOS transistors) are 2.52 and 1.77 $\mu$ m, respectively. The widths of both the supply rails $V_{\rm DD}$ and ground (gnd) are 0.56 $\mu \rm m$ and the width of the supply rail $V_{\rm DD~L}$ is 0.31 $\mu$ m. All SAHB cells are verified by a Cadence abstract generator and their library exchange format (LEF) file is generated for the auto placeand-route process. ## D. Benchmarking Table I tabulates several characteristics of a buffer cell embodying various async cell design approaches; the buffer cell is the *de facto* circuit for analysis (although the results may vary for other cells). Consider first the overall perspective of the various approaches as a preamble to the interpretation of the benchmarking. The PCHB and SAHB cells are fully QDI compliant, thereby not requiring any timing assumptions, and hence feature excellent robustness. The PSO, LP2/1, SAPTL, STAPL, and STFB buffer cells, on the other hand, require delay assumptions [9], [11] for their implementation/operation. Their robustness is hence somewhat compromised. From Table I, as expected, the 2-phase $(2\phi)$ handshaking protocol buffer cells, STAPL and STFB, feature fast cycle time and good static slack. The cycle time is defined as the | TABLE I | |-----------------------------------------------------------------------------------------| | GENERAL CHARACTERISTICS OF A BUFFER CELL EMBODYING VARIOUS ASYNC CELL DESIGN APPROACHES | | Characteristics | QDI Te | emplate | Time | ed-Pipeline Tem | Single-Track Template | | | |-------------------------------|-----------|-----------|----------|-----------------|-----------------------|------------|-----------| | Characteristics | SAHB | PCHB [15] | PS0 [16] | LP2/1 [17] | SAPTL [20] | STAPL [18] | STFB [19] | | Logic family implementation | Static | Dynamic | Dynamic | Dynamic | Pass | Dynamic | Dynamic | | Robustness (timing) | Excellent | Excellent | Good | Good | Good | Good | Good | | Handshake | $4\phi$ | $4\phi$ | $4\phi$ | $4\phi$ | $4\phi$ | $2\phi$ | 2φ | | Cycle time (transitions) | 12 | 14 | 10 | 10 | 18 | 8 | 6 | | Forward latency (transitions) | 2 | 2 | 2 | 2 | 4 | 2 | 2 | | Area (transistors) | 34 | 44 | 22 | 33 | 52 | 32 | 31 | | Static slack (%) | 50 | 50 | 50 | 50 | 50 | 100 | 100 | Fig. 8. Layout view of the SAHB cell. (a) Various subblocks. (b) Geometry template. number of switching transitions to complete one cycle in a three-stage pipeline ring. The static slack is the maximum token occupancy in one pipeline stage during the operation; the $2\phi$ and $4\phi$ buffer cells have 100% (full-buffer) and 50% (half-buffer) occupancies, respectively. The STFB buffer cell has the best cycle time and the PSO buffer cell has the least transistor count. In view of our intended power management application with FDVS, we will now focus on the fully QDI cell templates—our proposed SAHB and the reported PCHB. To appreciate their different circuit realizations, the buffer cell embodying our SAHB and the reported PCHB is depicted Fig. 9. Circuit schematic of a buffer cell embodying the reported PCHB. To optimize the PCHB cell, the noninverting C-Muller with three inverters could be replaced by the inverting C-Muller with two inverters. The optimized PCHB cell is $\sim$ 7% slightly faster speed than the less optimized counterpart. in Figs. 3 and 9, respectively. Of particular interest, the SA block in our SAHB cell (see Fig. 3) integrates both the input and output detection circuits (instead of being separate entities in the PCHB cell), and the evaluation block and SA block are tightly coupled together. Further, the complementary signals (nQ.T and nQ.F) are used both internally and for external signal interface in the SAHB cell, while in the PCHB cell, the implicit internal complementary signals (S.T and S.F)are switched internally only. By means of the aforesaid, our SAHB cell features not only less number of transistors but also less number of switching nodes, and hence more area efficient and power efficient than the reported PCHB. Further, as the SA block incorporates the integrated SA circuitry where the speed of the data propagation in our SAHB cell is enhanced, the SAHB cell is hence more speed efficient (and shorter cycle time) than its PCHB counterpart. From Table I, both our SAHB and the PCHB cells have the same forward latency and static slack. On the basis of simulations, Table II benchmarks the power dissipation, delay, power $\times$ delay product, power $\times$ delay product, and IC area of six library cells embodying our SAHB and the competing PCHB cell design approaches. For ease of interpretation, the attributes of the different PCHB library cells are normalized with respect to their corresponding Fig. 10. Simplified architecture of a 64-bit KS adder embodying the SAHB cell design approach. TABLE II PARAMETERS OF VARIOUS LIBRARY CELLS EMBODYING THE SAHB AND PCHB CELL DESIGN APPROACHES | | | Power (µw) | | Delay $2*t_F + 2*t_B$ | | Power × Delay | | Power × Delay <sup>2</sup> | | IC area | | |------------------|------------------|-------------------|-------|-----------------------|-------|------------------------|-------|----------------------------|-------|---------------------------|-------| | No Library Cells | | @ 1V, 1GHz | | (ps) @ 1V | | $(10^{-12} \text{ J})$ | | $(10^{-21} \text{ Js})$ | | (μm × μm) | | | | | SAHB | PCHB | SAHB | PCHB | SAHB | PCHB | SAHB | PCHB | SAHB | PCHB | | 1 | 1-input Buffer | 1× ( 7.1) | 3.37× | 1× (294) | 1.38× | 1× (2.08) | 4.65× | 1× ( 612) | 6.42× | 1× (5×4.6) | 1.09× | | 2 | 2-input AND/NAND | 1× (11.4) | 2.74× | 1× (392) | 1.36× | 1× (4.46) | 3.72× | 1× (1752) | 5.06× | $1 \times (5 \times 5.4)$ | 1.07× | | 3 | 2-input OR/NOR | $1 \times (11.1)$ | 2.82× | 1× (380) | 1.40× | 1× (4.68) | 3.95× | 1× (1776) | 5.53× | $1 \times (5 \times 5.4)$ | 1.07× | | 4 | 2-input XOR/XNOR | $1 \times (12.1)$ | 2.61× | 1× (488) | 1.15× | 1× (5.90) | 3.00× | 1× (2880) | 3.45× | 1× (5×6.6) | 1.06× | | 5 | 2-input MUX/IMUX | $1 \times (14.1)$ | 2.61× | 1× (544) | 1.13× | 1× (7.68) | 2.95× | 1× (4172) | 3.33× | 1× (5×6.6) | 0.97× | | 6 | 3-input AO/AOI | 1× (13.8) | 2.65× | 1× (490) | 1.22× | 1× (6.76) | 3.23× | 1× (3312) | 4.45× | 1× (5×7.8) | 1.08× | | Average | | 1× (11.6) | 2.80× | 1× (432) | 1.27× | 1× (5.26) | 3.58× | 1× (2440) | 4.71× | 1× (5×6.1) | 1.06× | SAHB library cells whose actual values are shown within parentheses. The average attributes of the six library cells are tabulated in the last row of Table II. It is apparent from Table II that the library cells embodying the PCHB, on average, dissipate $2.8\times$ higher power and operate at $1.27\times$ slower speed than that embodying the SAHB. Consequently, in terms of power $\times$ delay and power $\times$ delay<sup>2</sup> products, the library cells embodying the PCHB are uncompetitive—on average, by $3.58\times$ and $4.71\times$ worse, respectively, than those embodying the SAHB. In terms of IC area, the library cells embodying both the PCHB and SAHB are largely comparable; on average, the PCHB library cells occupy $1.06\times$ larger IC area. In short, the library cells embodying the SAHB are simultaneously superior in terms of power, delay, and IC area than those embodying the PCHB. ### III. 64-bit SAHB KOGGE-STONE ADDER Consider now the benchmarking for a larger circuit. This section describes the implementation of a 64-bit KS pipeline adder embodying the SAHB cell design approach and benchmarked against the PCHB. #### A. 64-bit KS Pipeline Adder (SAHB Adder) Fig. 10 depicts a simplified architecture of the KS pipeline adder embodying the SAHB cell design approach. Table III tabulates the realization of the SAHB pipeline blocks in group propagate—generate (PG) logic in terms of symbol view, cell view, and SAHB design view. The primary input operands of the adder are $A = A_{63} \cdots A_0$ , $B = B_{63} \cdots B_0$ , and carry-in $C_{\text{in}}$ . The primary output operands are $S = S_{63} \cdots S_0$ and carry-out $C_{\text{out}}$ . For the sake of illustration, the async handshake signals (and their complementary signals) are not shown. The SAHB adder (consisting of a bitwise PG logic, a group PG logic, and a sum logic) is constructed in a multiple carry look-ahead tree level so that the carry propagation time is shortened, thereby increasing speed [27]. Overall, eight pipeline stages are required in the KS adder, resulting in a (forward) latency of eight pipeline delays and a throughput rate (the inverse of cycle time) equal to one pipeline stage. The analytical equations to compute the various propagate signals $P(i)_n$ , $P(i:k)_n$ and $P(k-1)_n$ and various generate signals $G(i)_n$ , $G(i:k)_n$ and $G(k-1)_n$ at pipeline n are reported in [27] and [43]. In Fig. 10, | No. | Symbol View | Cell View | SAHB Design View | |-----|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 1 | $G(i:f)_{n-1}$ $G(i:f)_n$ | $G(i:j)_{n-1} - G(i:j)_n$ | $Ack_{n-1} \leftarrow SAHB$ $G(i:j)_{n-1} \longrightarrow Buffer$ $G(i:j)_n$ | | 2 | $\{G(i:k)_{n-1}, P(i:k)_{n-1}\}\$ $G(k-1:j)_{n-1}$ $G(i:j)_n$ | AO/AOI $G(i:k)_{n-1}$ $P(i:k)_{n-1}$ $G(k-1:j)_{n-1}$ $G(k-1:j)_{n-1}$ | $Ack_{n-1} \longleftrightarrow Ack_n$ $G(i:k)_{n-1} \longrightarrow SAHB$ $P(i:k)_{n-1} \longrightarrow G(k-1:j)_{n-1} \longrightarrow G(i:j)_n$ | | 3 | $ \left\{ \begin{array}{l} G(\hat{t}:k)_{n-1} , P(\hat{t}:k)_{n-1} \\ \\ \left\{ \begin{array}{l} G(k-1:j)_{n-1} , P(k-1:j)_{n-1} \\ \end{array} \right. \\ \\ \left\{ \begin{array}{l} G(\hat{t}:j)_n , P(\hat{t}:j)_n \end{array} \right\} $ | AO/AOI $G(i:k)_{n-1}$ $P(i:k)_{n-1}$ $G(k-1:j)_{n-1}$ $P(k-1:j)_{n-1}$ AND/NAND | $Ack_{n-1} \leftarrow C$ $G(i:K)_{n-1} \rightarrow Ack_n$ $AOI$ $P(i:K)_{n-1} \rightarrow AOI$ $G(k-1:j)_{n-1} \rightarrow AOI$ $SAHB$ $AOI$ | $\label{thm:constraint} \textbf{TABLE III}$ Realization of SAHB Pipeline Blocks in the Group PG Logic four SAHB library cells, i.e., buffer, AND/NAND, XOR/XNOR, and AO/AOI cells, are used and their schematics were depicted earlier in Figs. 3 and 4. Other single-rail library cells (e.g., C-Muller cells) are also used. The handshake signal to the preceding pipeline stage $(Ack_{n-1})$ is asserted when the SAHB pipeline cells have evaluated their outputs or deasserted when the SAHB pipeline cells have reset the outputs to empty. The handshake signal from the succeeding pipeline stage $(Ack_n)$ indicates whether the outputs $G(i:k)_n$ and/or $P(i:k)_n$ are accepted by the next connecting SAHB pipeline cells. A C-Muller cell is used to join the $Ack_{n-1}$ generated from two parallel SAHB pipeline cells if the same input $P(i:k)_{n-1}$ is accepted by the different pipeline cells (see the last row in Table III). The handshake connections for the other SAHB pipeline cells are constructed similarly. ## B. Chip Implementation and Verification The KS SAHB adder IC is fabricated using an ST Microelectronics 65-nm CMOS general-purpose standard threshold voltage process whose $V_{\rm tn}=0.35$ and $V_{\rm tp}=-0.35$ V at $V_{\rm DD}=1$ V. Fig. 11(a) and (b), respectively, depict the microphotograph and the layout of the SAHB adder with the test structure. We adopt a hybrid full-/semicustom approach to design the SAHB adder. The layouts of all the SAHB library cells and standard (single-rail) library cells, including C-Muller cells, were first handcrafted using the Cadence layout tool. With the extracted LEF files of the library cells and the Verilog files of the SAHB adder, the layout of the SAHB adder was thereafter placed and routed using the Cadence First Encounter tool. The SAHB adder netlist was finally simulated/verified using the Synopsys Nanosim tool and prototype Fig. 11. 64-bit SAHB KS pipeline adder. (a) Microphotograph. (b) Layout view. ICs were physically tested/measured. The core area of the KS SAHB adder is 306 $\mu$ m $\times$ 209 $\mu$ m. Although our SAHB cells appear to be complex, the implementation issues in this complex adder design is manageable. This is because our SAHB cells have a higher height to allow Fig. 12. Signal waveforms of the SAHB adder operations. (a) Subthreshold ( $V_{\rm DD}\sim0.25$ V). (b) FDVS ( $V_{\rm DD}$ from 1.4 to 0.3 V). up to 16 parallel metal traces to route over a cell and we used Cadence First Encounter tool to optimize the routing based on the eight-metal 65-nm CMOS process. The final design has, excluding the interconnections, an 80% core utilization. All 20 KS SAHB adder prototype ICs were measured and were found to be fully functional. Of these 20 ICs, 5 and 15 ICs are functional for $V_{\rm DD} \geq 0.25$ V and $V_{\rm DD} \geq 0.3$ V, respectively. It is interesting to note that our design at subthreshold voltage features higher speed operation compared with some reported subthreshold designs. For example, based on the same 65-nm CMOS process, a recently reported 32-bit subthreshold KS adder [44] operates at 3 MHz at $V_{\rm DD} = 300$ mV, whereas our SAHB KS 64-bit adder design operates at a higher speed of 3.76 MHz for the same $V_{\rm DD}$ . Fig. 12(a) depicts the $V_{DD}$ (0.25 V) and output time-domain waveforms for one of the above-said five KS SAHB ICs. As these ICs were fully functional for $V_{DD}$ ranging from subthreshold voltage (0.3 V) $\rightarrow$ near-threshold voltage $\rightarrow$ nominal voltage (1.0 V), our SAHB approach is applicable for FDVS [28]. By comparison, the reported QDI and TP designs (PCHB, PS0, etc.) would likely be more applicable only to half-range dynamic voltage scaling (i.e., near-threshold voltage $\rightarrow$ nominal voltage). This is because these reported designs adopt dynamic logic style, where the cross-coupled inverters (in the IL) are not functionally robust in the subthreshold voltage regime. Consider now the operational robustness of our SAHB adder against $V_{\rm DD}$ variation for an *in situ* self-adaptive $V_{\rm DD}$ system [14], where $V_{\rm DD}$ is automatically adjusted such that the minimum $V_{\rm DD}$ voltage is applied—the intention is the lowest power operation for the given prevailing condition. The top and bottom traces of Fig. 12(b), respectively, depict the real-time varying $V_{\rm DD}$ (from 1.4 to 0.3 V) and the generated output. It can be seen that even when $V_{\rm DD}$ is varied Fig. 13. Normalized figures of merit. (a) Energy per operation $(E_{per})$ , (b) Test speed (1/t), (c) $E_{per} \cdot t$ , and (d) $E_{per} \cdot t^2$ , all normalized to the readings taken at 1 V, 27 °C. widely, the operation is uninterrupted and error free. On this basis, circuits embodying our SAHB cell design approach are advantageous for power/speed tradeoff through voltage scaling with low transition/recovery time [29]. ## C. Results and Benchmarking Fig. 13 depicts normalized energy per operation ( $E_{\rm per}$ ), test speed (1/t), $E_{\rm per} \cdot t$ , and $E_{\rm per} \cdot t^2$ of a typical prototype chip for different supply voltages ( $V_{\rm DD}=0.3$ to 1.4 V and $V_{\rm DD\_L}=0.3$ V) and at different temperatures (-40 °C, 0 °C, 27 °C, and 100 °C). The results are normalized with respect to the readings taken at 1 V, 27 °C, where $E_{\rm per}=76.5$ pJ, 1/t=125 MHz, $E_{\rm per} \cdot t=610 \times 10^{-21}$ J · s, and $E_{\rm per} \cdot t^2=4.86 \times 10^{-27}$ J · s². The test speed includes the test structure circuit overheads (for loading/synchronizing inputs) to test the operations at different temperatures/voltages. The actual throughput of the SAHB adder is expected to feature $10\times$ faster than the test speed. The test jig is placed into a temperature chamber (model: Binder MK53) and the chamber temperature is carefully controlled and is stable for at least 1 h before measurement readings are taken. For more accurate temperature measurements, thermal sensors could be inserted into the silicon and placed in close proximity to the test circuits. From Fig. 13(a), we remark the following for the $E_{\rm per}$ plot. First, as expected, $E_{\rm per}$ reduces as $V_{\rm DD}$ reduces from 1.4 V until to the minimum $E_{\rm per}$ voltage point (i.e., $V_{\rm DD}=0.3$ V, in the subthreshold voltage regime). Second, from the subthreshold to nominal voltage regime (0.3 to 1.4 V), $E_{\rm per}$ increases when the temperature increases. This is as expected due to the increase in static power dissipation. From Fig. 13(b), we remark the following for the 1/t plot. First, as expected, 1/t reduces as $V_{\rm DD}$ reduces. Second, for the near-threshold voltage to nominal voltage regime (0.7 to 1.4 V), 1/t reduces when the temperature increases. This is due to the slower electron mobility at higher temperature. However, in the subthreshold voltage to near-threshold voltage regime (0.3 to 0.7 V), 1/t conversely increases when the temperature increases. This is due to the subthreshold operation effects [30]. From Fig. 13(c), we remark the following for the $E_{\rm per} \cdot t$ plot. First, $E_{\rm per} \cdot t$ reduces as $V_{\rm DD}$ reduces from 1.4 V until to the minimum $E_{\rm per} \cdot t$ voltage point (i.e., $V_{\rm DD} = 0.6$ V, in the near-threshold voltage regime). Further reducing $V_{\rm DD}$ from that point causes $E_{\rm per} \cdot t$ to be higher than the minimum $E_{\rm per} \cdot t$ . Second, the minimum $E_{\rm per} \cdot t$ point decreases when the temperature decreases. Third, $V_{\rm DD}$ for the minimum $E_{\rm per} \cdot t$ reduces when the temperature decreases. Fourth, in the near-threshold voltage to nominal voltage regime (0.5 to 1.4 V), $E_{\rm per} \cdot t$ increases when the temperature increases. However, in the subthreshold voltage to near-threshold voltage regime (0.3 to 0.5 V), $E_{\rm per} \cdot t$ conversely decreases when the temperature increases. Finally, from Fig. 13(d), we remark the following for the $E_{\rm per} \cdot t^2$ plot. First, $E_{\rm per} \cdot t^2$ slightly reduces and remains relatively constant in the near-threshold voltage to nominal voltage regime (0.6 to 1.4 V). However, it increases significantly in the subthreshold voltage regime (0.3 to 0.5 V) and this is expected as $t^2$ increases significantly. Second, in the near-threshold voltage to nominal voltage regime (0.6 to 1.4 V), $E_{\rm per} \cdot t^2$ increases when the temperature increases. However, in the subthreshold voltage to near-threshold voltage regime (0.3 to 0.6 V), $E_{\rm per} \cdot t^2$ conversely decreases when the temperature increases. For the completeness of comparison, we further benchmark, on the basis of simulations, our SAHB KS adder against its PCHB and sync KS adder counterparts. The PCHB and sync KS designs are designed/simulated using the same process. The sync design is synthesized to its fastest speed. Fig. 14 depicts the normalized energy per operation versus the throughput for the three designs. For ease of comparison, the results are normalized to that of the SAHB adder at the 1-GHz throughput. The throughput of the SAHB and PCHB designs are adjusted by means of voltage scaling from $V_{\rm DD}=1~\rm V$ to lower $V_{\rm DD}$ until they fail. On the other hand, the Fig. 14. Normalized energy per operation versus throughput of the SAHB, PCHB, and sync 64-bit KS pipeline adders, normalized to the readings of SAHB adder at 1 GHz throughput. throughput of the sync design is adjusted through frequency scaling at $V_{\rm DD}=1~{\rm V}$ and voltage scaling is not considered due to a need for timing matching. From Fig. 14, we remark the following. First, the sync design has the highest maximum throughput, up to 4 GHz (not shown), whereas the maximum throughputs of the SAHB and PCHB designs are 1.23 and 1.02 GHz, respectively. This is somewhat expected as both the SAHB and PCHB designs abide by the QDI async protocols and require some delay (transition) overheads to acknowledge their operation sequence. Further, the transistors of the SAHB design (and PCHB design) are not sized for maximum speed but for low power dissipation (see Section II-C). Further, the sync adder leverages on timing assumptions associated with the clock, hence potentially featuring a higher speed. In the context of accommodating for PVT variations, the SAHB and PCHB designs would feature extremely good operational robustness, while the sync design would not be error free if the clock timing is violated. In other words, in a practical design, the sync design would need a realistic timing margin and would hence operate slower. Second, it is particularly interesting that the sync design is less energy efficient than the SAHB design. The higher energy dissipation of the sync design is due to a large number of registers used (for high speed gate-level pipelining) and, to some extent, due to the energy dissipated in the high-speed clock buffers. For a somewhat more balanced comparison, it is instructive to note that although the sync design could be redesigned to embody a different architecture (e.g., having a block-level pipeline) to reduce the energy dissipation (but at the cost of having a slower throughput), such analysis and other permutations are not considered here. Third, at the fixed throughput rate of 1 GHz, the sync and PCHB designs dissipate 1.65× and 2.29× higher energy, respectively, than our SAHB design. Fourth, of the SAHB and PCHB designs, the SAHB design is more energy and speed efficient than the PCHB design. Further, the PCHB design is also less area efficient, occupying 1.31× more transistors than our SAHB design, translating to $\sim 7\%$ larger area. | Design Approach | | Async | | Sync | | | | | | | | |--------------------------|--------------------------|--------------------------|--------------------|-----------------|----------------------------------|----------------|-----------------|---------------------|--------------------------------|---------------------------------|----------------------------------| | 64-Bit Adder | SAHB& | PCHB& | Ferretti<br>[19] | Sync* | Stasiak <sup>&amp;</sup><br>[31] | Lee<br>[32] | Huang<br>[33] | Zlatanovici<br>[34] | Frustaci <sup>&amp;</sup> [35] | Zeydel <sup>&amp;</sup><br>[27] | Kim<br>[36] | | CMOS (nm) | 65 | 65 | 250 | 65 | 180 | 250 | 250 | 90 | 90 | 65 | 180 | | $V_{\rm DD}({ m V})$ | 1.0 | 1.0 | 2.5 | 1.0 | 1.5 | 2.5 | 2.5 | 1.0 | 1.0 | 1.1 | 1.0 | | Pipeline Structure | P <sup>+</sup> | $P^+$ | $P^+$ | $P^+$ | NP* | $P^+$ | NP* | NP* | NP* | NP* | NP* | | Algorithm | KS CLA <sup>∆</sup> | KS CLA <sup>∆</sup> | $CLA^{\Delta}$ | $CLA^{\Delta}$ | CS^/<br>CLA <sup>∆</sup> | $CLA^{\Delta}$ | RC□ | $CLA^{\Delta}$ | $CLA^{\Delta}$ | $CLA^{\Delta}$ | CS^ | | Logic Coding | QDI &<br>DR <sup>α</sup> | QDI &<br>DR <sup>α</sup> | ST & $DR^{\alpha}$ | SR# | Logic Family | SAHB<br>logic | PCHB<br>logic | STFB<br>logic | Static<br>logic | Domino<br>logic | Race<br>logic | Domino<br>logic | Domino<br>logic | Dynamic/<br>domino<br>logic | Domino<br>logic | Boosted<br>differential<br>logic | | PVT Immunity | Excellent | Excellent | Good | Poor | Energy, $E$ (fJ) | 32.7 | 57.8 | 190 | 78 | NA | 97.2 | NA | 62.4 | 4.3 | 6.8 | 152 | | Throughput, 1/t (GHz) | 1.23 | 1.02 | 1.45 | 4.00 | 2.27 | 1.10 | 1.28 | 4.16 | 1.29 | 5.26 | 2.50 | | Latency (ns) | 0.98 | 1.19 | 1.80 | 2.00 | 0.44 | 0.90 | 0.64 | 0.24 | 0.39 | 0.19 | 0.40 | | Area (mm <sup>2</sup> ) | 0.06 | 0.07 | 0.96 | NA | 0.36 | 0.12 | 0.04 | 0.03 | 0.01 | NA | 0.54 | | $E *t (\times 10^{-23})$ | 2.7 | 5.7 | 13.1 | 2.0 | NA | 8.8 | NA | 1.5 | 0.33 | 0.13 | 6.1 | TABLE IV COMPARISON OF VARIOUS 64-bit ADDERS For completeness, Table IV tabulates a comparison of several reported 64-bit adders. Although the comparison is somewhat contentious due to large variations of the different designs, architectures, pipelining and parameters, and so on, it is nonetheless worthwhile to note that our SAHB adder is robust, insensitive to PVT variations, and energy efficient. #### IV. CONCLUSION We have proposed a novel SAHB realization approach with emphases on high operational robustness, high speed, and low energy dissipation. These attributes are collectively achieved by several novel circuit designs/operations, including QDI-compliant operation, cross-coupled latch with a positive feedback mechanism in the SA block, reduced switching nodes in the evaluation and SA blocks, minimum sizing of the nMOS pull-up network in the evaluation block, and static logic operation. The basic library cells embodying SAHB have been shown to feature higher speed, low energy dissipation, and lower transistor count (and smaller IC area) than those embodying the reported competing PCHB. The 64-bit SAHB adder has been prototyped for a power management application with FDVS. To demonstrate its energy efficiency, the proposed SAHB adder has been benchmarked against its competing PCHB and sync equivalents, where its advantageous attributes have been demonstrated. #### REFERENCES - [1] D. Rossi, C. Mucci, M. Pizzotti, L. Perugini, R. Canegallo, and R. Guerrieri, "Multicore signal processing platform with heterogeneous configurable hardware accelerators," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 22, no. 9, pp. 1990–2003, Sep. 2014. - [2] D. N. Truong et al., "A 167-processor computational platform in 65 nm CMOS," IEEE J. Solid-State Circuits, vol. 44, no. 4, pp. 1130–1144, Apr. 2009. - [3] G. Jiang, Z. Li, F. Wang, and S. Wei, "A low-latency and low-power hybrid scheme for on-chip networks," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 23, no. 4, pp. 664–677, Apr. 2015. - [4] S. M. Nowick and M. Singh, "Asynchronous design—Part 1: Overview and recent advances," *IEEE Des. Test*, vol. 32, no. 3, pp. 5–18, Jun. 2015. - [5] K.-S. Chong, K.-L. Chang, B.-H. Gwee, and J. S. Chang, "Synchronous-logic and globally-asynchronous-locally-synchronous (GALS) acoustic digital signal processors," *IEEE J. Solid-State Circuits*, vol. 47, no. 3, pp. 769–780, Mar. 2012. - [6] H. Zakaria and L. Fesquet, "Designing a process variability robust energy-efficient control for complex SOCs," *IEEE J. Emerg. Sel. Topics Circuits Syst.*, vol. 1, no. 2, pp. 160–172, Jun. 2011. - [7] International Technology Roadmap for Semiconductors (ITRS), Semicond. Ind. Assoc., Washington, DC, USA, 2013. - [8] S. M. Nowick and M. Singh, "Asynchronous design—Part 2: Systems and methodologies," *IEEE Des. Test*, vol. 32, no. 3, pp. 19–28, Jun. 2015. - [9] A. J. Martin, "The limitations to delay-insensitivity in asynchronous circuits," in *Proc. 6th MIT Conf. Adv. Res. VLSI*, 1990, pp. 263–278. - [10] R. Zhou, K.-S. Chong, B.-H. Gwee, and J. S. Chang, "A low overhead quasi-delay-insensitive (QDI) asynchronous data path synthesis based on microcell-interleaving genetic algorithm (MIGA)," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 33, no. 7, pp. 989–1002, Jul. 2014. - [11] A. J. Martin and M. Nystrom, "Asynchronous techniques for systemon-chip design," *Proc. IEEE*, vol. 94, no. 6, pp. 1089–1120, Jun. 2006. - [12] J. Sparsø, J. Staunstrup, and M. Dantzer-Sørensen, "Design of delay insensitive circuits using multi-ring structures," in *Proc. Eur. Design Autom. Conf.*, 1992, pp. 7–10. - [13] R. D. Jorgenson *et al.*, "Ultralow-power operation in subthreshold regimes applying clockless logic," *Proc. IEEE*, vol. 98, no. 2, pp. 299–314, Feb. 2010. - [14] T. Lin, K.-S. Chong, J. S. Chang, and B.-H. Gwee, "An ultra-low power asynchronous-logic in-situ self-adaptive V<sub>DD</sub> system for wireless sensor networks," *IEEE J. Solid-State Circuits*, vol. 48, no. 2, pp. 573–586, Feb. 2013. - [15] A. J. Martin et al., "The design of an asynchronous MIPS R3000 microprocessor," in Proc. 17th Conf. Adv. Res. VLSI, Sep. 1997, pp. 164–181. - [16] T. E. Williams and M. A. Horowitz, "A zero-overhead self-timed 160-ns 54-b CMOS divider," *IEEE J. Solid-State Circuits*, vol. 26, no. 11, pp. 1651–1661, Nov. 1991. - [17] M. Singh and S. M. Nowick, "The design of high-performance dynamic asynchronous pipelines: Lookahead style," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 15, no. 11, pp. 1256–1269, Nov. 2007. - [18] M. Nystrom, E. Ou, and A. J. Martin, "An eight-bit divider implemented in asynchronous pulse logic," in *Proc. IEEE Int. Symp. Asynchronous Circuits Syst. (ASYNC)*, Apr. 2004, pp. 229–239. <sup>&</sup>lt;sup>+</sup>P – Pipeline; <sup>\*</sup>NP - Non-pipeline; <sup>^</sup>CS – Carry-Select; <sup>&</sup>lt;sup>∆</sup>CLA – Carry Look-ahead; RC – Ripple Carry; <sup>\*</sup>SR – Single-rail; <sup>&</sup>lt;sup>α</sup>DR – Dual-rail; <sup>&</sup>amp;Based on Simulation Results - [19] M. Ferretti and P. A. Beerel, "High performance asynchronous design using single-track full-buffer standard cells," *IEEE J. Solid-State Circuits*, vol. 41, no. 6, pp. 1444–1454, Jun. 2006. - [20] T.-T. Liu, L. P. Alarcon, M. D. Pierson, and J. M. Rabaey, "Asynchronous computing in sense amplifier-based pass transistor logic," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 17, no. 7, pp. 883–892, Jul. 2009. - [21] Z. Xia, M. Hariyama, and M. Kameyama, "Asynchronous domino logic pipeline design based on constructed critical data path," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 23, no. 4, pp. 619–630, Apr. 2015. - [22] P. Golani and P. A. Beerel, "Area-efficient asynchronous multilevel single-track pipeline template," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 22, no. 4, pp. 838–849, Apr. 2014. - [23] M.-C. Chang and W.-H. Chang, "Asynchronous fine-grain power-gated logic," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 21, no. 6, pp. 1143–1153, Jun. 2013. - [24] W.-H. Ma, J. C. Kao, V. S. Sathe, and M. C. Papaefthymiou, "187 MHz subthreshold-supply charge-recovery FIR," *IEEE J. Solid-State Circuits*, vol. 45, no. 4, pp. 793–803, Apr. 2010. - [25] K.-L. Chang, B.-H. Gwee, J. S. Chang, and K.-S. Chong, "Synchronous-logic and asynchronous-logic 8051 microcontroller cores for realizing the Internet of Things: A comparative study on dynamic voltage scaling and variation effects," *IEEE J. Emerg. Sel. Topics Circuits Syst.*, vol. 3, no. 1, pp. 23–34, Mar. 2013. - [26] R. J. Baker, CMOS: Circuit Design, Layout, and Simulation. New York, NY, USA: Wiley, 2011. - [27] B. R. Zeydel, D. Baran, and V. G. Oklobdzija, "Energy-efficient design methodologies: High-performance VLSI adders," *IEEE J. Solid-State Circuits*, vol. 45, no. 6, pp. 1220–1233, Jun. 2010. - [28] J. S. Chang, B. H. Gwee, and K. S. Chong, "Asynchronous-logic circuit for full dynamic voltage control," U.S. Patent 8791717 B2, Jul. 29, 2014. - [29] J. S. Chang, B. H. Gwee, and K. S. Chong, "Digital cell," U.S. Patent 8994406 B2, Mar. 31, 2015. - [30] A. Raychowdhury, B. C. Paul, S. Bhunia, and K. Roy, "Computing with subthreshold leakage: Device/circuit/architecture co-design for ultralow-power subthreshold operation," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 13, no. 11, pp. 1213–1224, Nov. 2005. - [31] D. L. Stasiak, F. Mounes-Toussi, and S. N. Storino, "A 440-ps 64-bit adder in 1.5-V/0.18-μm partially depleted SOI technology," *IEEE J. Solid-State Circuits*, vol. 36, no. 10, pp. 1546–1552, Oct. 2001. - [32] S.-J. Lee and H.-J. Yoo, "Race Logic Architecture (RALA): A novel logic concept using the race scheme of input variables," *IEEE J. Solid-State Circuits*, vol. 37, no. 2, pp. 191–201, Feb. 2002. - [33] C.-H. Huang, J.-S. Wang, C. Yeh, and C.-J. Fang, "The CMOS carry-forward adders," *IEEE J. Solid-State Circuits*, vol. 39, no. 2, pp. 327–336, Feb. 2004. - [34] R. Zlatanovici, S. Kao, and B. Nikolic, "Energy-delay optimization of 64-bit carry-lookahead adders with a 240 ps 90 nm CMOS design example," *IEEE J. Solid-State Circuits*, vol. 44, no. 2, pp. 569–583, Feb. 2009. - [35] F. Frustaci, M. Lanuzza, P. Zicari, S. Perri, and P. Corsonello, "Designing high-speed adders in power-constrained environments," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 56, no. 2, pp. 172–176, Feb. 2009. - [36] J.-W. Kim, J.-S. Kim, and B.-S. Kong, "Low-voltage CMOS differential logic style with supply voltage approaching device threshold," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 59, no. 3, pp. 173–177, Mar. 2012. - [37] J. Sparsø and S. Furber, Principles of Asynchronous Circuit Design: A Systems Perspective. Norwell, MA, USA: Kluwer, 2001. - [38] P. A. Beerel, R. O. Ozdag, and M. Ferretti, A Designer's Guide to Asynchronous VLSI. Cambridge, U.K.: Cambridge Univ. Press, 2010. - [39] S. M. Nowick and M. Singh, "High-performance asynchronous pipelines: An overview," *IEEE Des. Test*, vol. 28, no. 5, pp. 8–22, Sep./Oct. 2011. - [40] D. M. Chapiro, "Globally-asynchronous locally-synchronous systems," Ph.D. dissertation, Dept. Comput. Sci., Stanford Univ., Stanford, CA, USA, 1985. - [41] R. Manohar and A. J. Martin, "Quasi-delay-insensitive circuits are turing-complete," Dept. Comput. Sci., Comput. Sci. Tech., Caltech, CA, USA, Tech. Rep. CS-TR-95-11, Nov. 1995, pp. 1–14. - [42] A. M. Lines, "Pipelined asynchronous circuits," M.S. thesis, Dept. Comput. Sci., California Inst. Technol., Pasadena, CA, USA, 1995. - [43] N. H. E. Weste and D. Harris, CMOS VLSI Design: A Circuits and Systems Perspective, 3rd ed. Reading, MA, USA: Addison-Wesley, 2005 - [44] H. Fuketa, M. Hashimoto, Y. Mitsuyama, and T. Onoye, "Adaptive performance compensation with *in-situ* timing error predictive sensors for subthreshold circuits," *IEEE Trans. Very Large Scale Integr. (VLSI)* Syst., vol. 20, no. 2, pp. 333–343, Feb. 2012. - [45] W.-G. Ho, K.-S. Chong, N. K. Z. Lwin, B.-H. Gwee, and J. S. Chang, "High robustness energy- and area-efficient dynamic-voltage-scaling 4-phase 4-rail asynchronous-logic network-on-chip (ANoC)," in *Proc. IEEE ISCAS*, Lisbon, Portugal, May 2015, pp. 1913–1916. **Kwen-Siong Chong** (S'03–M'09–SM'13) received the B.Eng., M.Phil., and Ph.D. degrees from Nanyang Technological University (NTU), Singapore, in 2001, 2002, and 2007, respectively, all in electrical and electronic engineering. He was a Visiting Researcher with the Nara Institute of Science and Technology, Ikoma, Japan, in 2010, and the University of Michigan, Ann Arbor, MI, USA, in 2012. He was/is a Co-Principal Investigator/Collaborator of several research projects, including the projects from the Defense Advanced Research Projects Agency in the USA, the Ministry of Education Tier-2 in Singapore, and the Public Sector Research Funding in Singapore. He is currently a Senior Research Scientist with Temasek Laboratories @ NTU, Singapore. His current research interests include asynchronous VLSI designs, low-voltage low-power VLSI circuits, resilient circuits and systems, and audio signal processing. Dr. Chong has been a member of the IEEE Circuits and Systems (CAS) Society VLSI Systems and Applications Technical Committee since 2009. He was the Vice Chair of the IEEE CAS Society, Singapore Chapter, in 2013 and 2014. He served on the organizing committee for several conferences, including the ASP-DAC 2014 and DSP-2015. Weng-Geng Ho (S'10) received the B.Eng. (Hons.) and Ph.D. degrees in electrical and electronic engineering from Nanyang Technological University (NTU), Singapore, in 2009 and 2016, respectively. He is currently a Research Scientist with the Hardware Assurance Team, Temasek Laboratories @ NTU, Singapore. His current research interests include low power secured memory design, digital VLSI design, asynchronous-logic circuit design, NoC-based multicore platform design, and side-channel-attack countermeasure. Dr. Ho was a recipient of the NTU Graduate Research Scholarship. **Tong Lin** (S'08–M'14) received the B.Eng. (Hons.) and Ph.D. degrees in electrical and electronic engineering from Nanyang Technological University (NTU), Singapore, in 2008 and 2014, respectively. He went for an exchange program with the University of Miami, Coral Gables, FL, USA, in 2006. He is currently a Research Scientist with Temasek Laboratories @ NTU, Singapore. His current research interests include asynchronous-logic circuit design, ultrarobust ultralow power circuit and system design, and fault-tolerant radiation-hardened circuit and system design. Dr. Lin was a recipient of a full undergraduate scholarship from the Ministry of Education, Singapore. He was also a recipient of the prestigious Nanyang President's Graduate Scholarship. He received the Best Student Paper Award at the IEEE Subthreshold Microelectronics Conference in 2012. **Bah-Hwee Gwee** (S'93–M'97–SM'03) received the B.Eng. degree in electrical and electronic engineering from the University of Aberdeen, Aberdeen, U.K., in 1990, and the M.Eng. and Ph.D. degrees from Nanyang Technological University (NTU), Singapore, in 1992 and 1998, respectively. He was an Assistant Professor with the School of EEE, NTU, from 1999 to 2005, where he has been an Associate Professor since 2005. He held the concurrent appointment of Assistant Chair (Students) with the School of EEE, NTU, from 2010 to 2014. He was the Principal Investigator (PI) of a number of research projects, including the ASEAN-European Union University Network Program, the Ministry of Education Tier-1 and Tier-2, the Agency for Science, Research and Technology, the Defence of Science and Technology Agency, and Temasek Laboratories projects. He was also the Co-PI of DARPA in USA, NTU-Panasonic, and NTU-Lingköping research projects. His total research grant is amounting to U.S. \$10 M. He has three U.S. patents granted in circuit design. His current research interests include sub-threshold, dynamic voltage scaling asynchronous circuit, GALS NoC, secured chip, Class-D amplifier, and de-dc converter designs. Dr. Gwee was the Chairman of the IEEE Singapore Circuits and Systems Chapter in 2005, 2006, 2013, and 2016. He has been the member of the IEEE CAS Society DSP, VLSI, and Bio-CAS Technical Committees since 2004. He is the Chairman of the DSP Technical Committee. He served on the organizing committees of the IEEE BioCAS-2004, the IEEE APCCAS-2006, and the IEEE DSP-2015, as the Technical Program Chair of ISIC-2007, ISIC-2011, and ISIC-2016, and on the Steering Committee of the IEEE APCCAS 2006–2008. He was an Associate Editor of the *Circuits, Systems and Signal Processing* journal from 2007 to 2012, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II—EXPRESS BRIEF from 2010 to 2011, and the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS from 2012 to 2013. He was an IEEE Distinguished Lecturer of the CAS Society in 2009/2010. **Joseph S. Chang** (M'04–SM'12) received the B.Eng. degree in ECE from Monash University, Melbourne, VIC, Australia, and the Ph.D. degree from the Department of Otolaryngology, University of Melbourne, Melbourne. He was the Associate Dean of Research and Graduate Studies with the College of Engineering, Nanyang Technological University (NTU), Singapore. He is currently with NTU as a Professor and the Director of the Virtus IC Design Center of Excellence. He is an Adjunct Professor with Texas A&M University, College Station, TX, USA. His research is highly multidisciplinary and publishes prolifically with over 200 science and engineering publications, over 30 awarded and pending patents, and has licensed technology to the industry. He has founded two startups, and has designed numerous related products adopted for the industry and commercially. Dr. Chang has received numerous academic, defense, and industrial grants exceeding U.S. \$15 M, including from the Defense Advanced Research Projects Agency in the USA, the EU multinational corporations, and Singapore funding agencies. He has served as a Guest Editor of the PROCEEDINGS OF THE IEEE, a Corresponding Guest Editor of the IEEE JOURNAL EMERGING AND SELECTED TOPICS, a Guest Editor of the IEEE Circuits and Systems Magazine, and an Editor of the Open Column of the IEEE Circuits and Systems Magazine. He was an Associate Editor of several IEEE publications, including the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEM—I, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II, and the IEEE Circuits and Systems Magazine, and is a Senior Editor of the IEEE JOURNAL EMERGING AND SELECTED TOPICS. He founded the new Hybrid and Printed Electronics Technical Committee, is the Chair of the Analog Signal Processing Technical Committee, and was the Chair of the Life Sciences Systems and Applications Technical Committee and of the Biomedical and Life Science Circuits and Systems Technical Committee, all of the IEEE Circuits and Systems Society. He has chaired several international conferences, including the IEEE-National Institutes of Health (NIH) Life Sciences Systems and Applications Workshop, the IEEE-NIH CAS Medical and Environmental Workshops, and the International Symposium on Integrated Circuits and Systems. He was an IEEE Distinguished Lecturer from 2012 to 2013. He was the Keynote/Plenary Speaker at several major conferences, including the IEEE ICCAS'13, IBBN'14, the IEEE Async'14, and the IEEE ICECS'15, and will be a Keynote Speaker at the impending IEEE MWSCAS'16.