Optimization of Clock Tree Synthesis Under Stochastic Process Variation Modeling for Multi-FPGA Systems

In this age of scientific computing, the experiment models and evaluation is a commonly employed rendition of the simulation methods. In addition to the optimality, the methods which depict underlying uncertainty in process variation. It is accomplished by adjusting number of samples on delay and wire width. Here addresses the thermal profile, if temperature gradually increases, also reduce worst case clock skew under thermal variation. Under the SSTA analysis the mean delay is 6.2 to 5.2% and standard deviation from 7.5 to 7.6% is reduced. Therefore the overall performance measure in storage and the run time is very low. Extensive simulation studies show that of how does one accurately and efficiently post-process stochastic simulation fields and how does one effectively and succinctly convey the results.


INTRODUCTION
Now-a-days, designing clock-distribution networks for high-speed chips is more complex than just meeting timing specifications. Achieving clock latency and clock skew are difficult when you have clock signals of 300 MHz or more transversing the chip. Because the clock network is one of the most power-hungry nets on a chip, you need to design with power dissipation in mind.
Using DME (Chao et al., 1992;Chang and Sapatnekar, 2003;Basir-kazeruni et al., 2013;Liu et al., 2008;Chakraborty et al., 2006;Khul et al., 2010) on chip thermal gradients proposed procedures to solve the thermal closure, thermal impact on a set of standard benchmarks. Recently a zero skew clock tree optimization, buffering, wire Sizing in Application Specific Integrated Circuits (ASIC). A stochastic synthesis algorithm for routing statistical Static Timing Analysis (SSTA) with statistical criticality for FPGAs (Lin et al., 2008). The results used for trigger and data Acquisition systems for high energy physics experiments. Also present the results of a jitter analysis, performed by exploring several configurations of DLLs and PLL embedded in a Xilinx virtex 5 FPGA.
Instead of analyzing the single FPGA Clock synchronization feature, multi-FPGA system clock synchronization have been taken into account. However, no existing work is related to Interconnect uncertainty frame work for mapping applications on 3D reconfigurable device starting from hardware description language up to configuration file generation.
The present work is implemented with 3D PRO tool. This study is presented as follows. Section 2 presents the Clock tree optimization and clock selectivity procedures. Section 3 presents the thermal profile. Section 4 analysis, evaluation models for stochastic simulation and the results discussions. Section 5 Concludes this study.

CLOCK TREE OPTIMIZATION AND CLOCK SELECTIVITY
The basics of Clock Tree Synthesis (CTS) is to develop the interconnect that connects the system clock into all the cells in the chip that uses the clock (Markov and Lee, 2011) For CTS, your major concerns are: • Minimizing the Clock Skew • Optimizing Clock Buffers to Meet Skew The primary job of CTS tools is to vary routing paths, placement of the clocked cells and clock buffers to meet maximum skew specifications. For a balanced tree without buffers (before CTS), the clock line's capacitance increases exponentially as you move from the clocked element to the primary clock input. The extra capacitance results from the wider metal needed to carry current to the branching segments. The extra metal also results in additional chip area to accommodate the extra clock-line width. Adding buffers at the branching points of the tree significantly lowers clock-interconnect capacitance, because you can reduce clock-line width towards the root.
When designing a clock tree, need to consider performance specifications that are timing-related. Clock-tree timing specifications include clock latency, skew and jitter. Non-timing specifications include power dissipation, signal integrity. Many clock-design issues affect multiple performance parameters; for example, adding clock buffers to balance clock lines and decrease skew may result in additional clock-tree power dissipation. The biggest problem we face in designing clock trees is skew minimization. The factors that contribute to clock skew include loading mismatchat the clocked elements, mismatch in RC delay.
Clock skew adds to cycle times, reducing the clock rate at which a chip can operate. Typically, skew should be 10% or less of a chip's clock cycle, meaning that for a 100-MHz clock, skew must be 1 nanosec or less. High-performance designs may require skew to be 5% of the clock cycle.

Clock Design Methodology
Most of the clock-network-design strategies that they use on dependent design. In this method, uses the Clock Generator tool along with Cadence place-and-route tools. This tool combination produces a tree with minimum insertion delay, a minimum number of buffers and maximum fan-out. Here skew is less than 300 picosec. After generation of the clock tree, the output from the place-and-route tool is flat, meaning that the design hierarchy is lost.
Effect of CTS: • Lots of clock buffers are added • Congestion may increase • Non-clock tree cells may have been moved to nonideal locations • Can introduce new timing violations The block diagram of the proposed All Digital Multiphase DLL is depicted in Fig. 1. It consists of five major blocks: they are Digitally Controlled Delay Line (DCDL), Frequency-Estimation Selector (FES), Narrow-Wide Coarse Lock Detector (NWCLD), Time to Digital Converter (TDC) and Edge combiner. The key advantages of these five major blocks are, good time resolution and stability by TDC, mismatching of the various harmonics of the clock signals can be monitored by using NWCLD along with FES, it results harmonic free and fast locking behavior of the DLL. The Digitally Controlled Delay Line (DCDL) is a digital control circuit whose delay is controlled by the digital control signals. It has both coarse and fine delay units in order to provide a effective delay of the clock signals according to the NWCLD outputs. Wide range frequency of operation of DLL is achieved by using FES, which reuses the delay lines and able to reduce the circuit area, power dissipation.
At first, NWCLD activates the FES and align the DCDL in the corresponding delay range. Then, the FES inputs multiphase outputs of DCDL to estimate input frequency range and generates digital codes (F1, F0). The TDC shown in Fig. 1 converts the reference clock's period information into multiples of Range Delay Units (RDUs) delay time. After TDC encoder, the DCDL range selection control code is sent to the NWCLD. After that, NWCLD uses (F1, F0) and signals from PFD to do the locking steps of the digital multiphase DLL. Once the DCDL clock signal is aligned with the reference clock signal, the ADMDLL achieves the lock state. After frequency acquisition between the input clock and the delayed clock in the NWCLD using the DCDL multi-phases, one-cycle phase lock occurs in the PFD. By using the NWCLD, we can not only make the best use of a DCDL range but make the DLL to be immune to SSN without harmonic lock and stuck problem.

Clock Selectivity
Clock Selectivity Algorithm determines the delay, interconnect and power for each and every point of (Jeng, 2004) information in a bottom-up approach. Next buffering and wire sizing is easily determined by topdown approach. It will be useful for modeling the constraints over the simulation, easy to handle the timing problems (Markov and Lee, 2011). The optimal guarantees the sub tree and propagate the solution towards the root tree Node, but do not guarantee the local adjustment.
The problem definition are listed as Minimal delay/Interconnect uncertainty: • Find wire-widths with bounded limit delay and power with interconnect uncertainty such that timing margins is reduced and power can be estimated within the range • And the wire sizing is the effective way of reducing the clock delay and applying the stochastic flows over the non Gaussian SSTA The Fig. 2 shows the 3D MEANDER framework. The interconnect and gate delay has been taken into account, applying Clock selectivity algorithm on the model, resistance-capacitance for the interconnect uncertainty. Buffers and Elmore-delay model for delay calculation reduces the complexity over the algorithm (Thoziyoor et al., 2008).
The wire length, width, resistance and capacitance on the design space. Considering the wire capacitance is equal, attached at both ends of the wire. The sources have some delay which models the intrinsic delay, as the number of sets is highly dependent. The dependency on the temporal locality and periodicity of the currents across different clock cycles in a design space. In this space, there exists at least one pair of points interested in the delay modeling.
In the first continuity phase, the region of all nodes isolated and formulated work space region and determines the dimension with given routing node. It is shown below: A set of target delay and capacitance load, the spanning tree points can be captured. In this proposed work one approach identifies all the nodes and constraints over the root node to leaf node. The top-down approach or bottom up approach can be investigated by the sample points on this particular space, estimates the region of nodes and analytically perform distance to the optimality over the clock tree, by adjusting number of samples on delay, wire sizing, interconnect Variation. Here the Non guassian SSTA applies for the accurate and deterministic routing.
Next approach is directly measure the clock trees jitter using Evaluation Board (EVB) measurements. With the help of SMA Cables connect the output from the XO/Clock generator EVB to input of the clock buffer EVB. Each component in the clock tree adds phase jitter to the starting reference clock. A very crucial one is timing component selection process to select particular devices that meet system specification with lowest jitter.

THERMAL PROFILE
Consider a clock tree has 'n' nodes; 'p' samples for delay and 'q' samples for wire width. The complexity arises with a non-uniform thermal profile. So the thermal induced skew for the initial clock tree is obtained. Further analyzed the minimal change in wire widths and samples due to the partitioning (Minsik et al., 2005). The Maximum and minimum values of the capacitive load of each and every delay samples, stores and calculated to the near by value for storage purpose. The initial routings are generated using DME algorithm. Next to top down approach is utilized for bringing to very low power. The routing node, where the branch is created and samples on two more variables, width size are eliminated easily (Basir-Kazeruni et al., 2013). Hence the storage value is lowered for each and every iteration lie inside the regions.
The thermally increased resistance produces a small drift to hot spot (value) to a new point. The thermal induced value is proportional to equal drift (simultaneous) in point location. It is due to the simultaneous switching noise is explores in switching activity on any device interface. Apart from that some points are traced in top-down approach, the merging points of the clock tree are found. Further clock tree is reduced without modifying the wire length and routing under the worst case conditions. The below procedure for device netlist format and clock latency is:
After the P and R approach, Apply Agility compiler from Celoxica, Provides synthesis path from System C code and Outputs Electronic Design Interchange Format (EDIF) netlists for FPGA Logic and RTL HDL code for ASIC Synthesis. The Implemented optimization includes automatic tree balancing, fine grained logic sharing and retiming has been explored.
Another tool PICO Express from Synofora that synthesizes ANSI-C into RTL. IN ASIC Design flow it takes the C Program code, a datasets and System requirements such as latency, between Clocking areas.

MODELS FOR STOCHASTIC SIMULATION
There are some techniques available for clock tree optimization (Cong et al., 1997) especially buffer insertion and wire sizing. There have been efforts to reduce the size of the wiring. The probabilistic behavior for real time input models vary in the experimental approach. A very few technique doesn't require the modification over the existing routing. In the experimental investigation, by applying one of the distribution family to model a simulation process. Subsequently, the other classes of modeling were used for simulation. Extensive work on stochastic simulation is reported in this study.
There are a variety of reasons to analyze the simulation models. Most times it to add some functionality for each and every simulation approach. The platform was enhanced, can simultaneously optimize delay, power and area with very low skew and sensitivity (Mueller and Whalley, 1995).
In Fig. 3 shows the block diagram of test bench. Set up. The board routes some pins of the FPGA (ML 505), Routed to Clock IO, to small connectors. The 50-ohm Coaxial cables to clock the FPGA and to analyze its output (Aloisio et al., 2010;Chen et al., 1996;ITRS, 2003;Data, 2006). In addition to this, Clock generator having a random jitter and is very small approximately 4.5 ps directed to the board. Finally a 50-ohm resistor is terminated the clock line.
Using the Elmore delay model, interconnect delays can be calculated. Also expressed as a function of the process parameters. Consider a chip area; divide into grids so that the parameter variations within grid are same. Some different grids exhibit spatial correlations.

AJAS
Simultaneously, Gate delay can be approximated as a function of parameter. The average and maximal functions are expressed as linear functions of the principal components. Thus all paths derived in the statistical timing graph also become the linear functions of the principle components. In this experiment, the parameter are set, the spatial and random variable components are accounted. Here the computation needs the location of gates and interconnects. In The Modeled circuits undefined and sink nodes are ignored the former path. The steps involved in spatial distribution of parameters, tend to increase power/energy consumption; as well as variation of on chip temperature value (Chen et al., 1996). This work shows significant reduction, regarding the 2D architectures (Mueller and Whalley, 1995) Moreover 3D architecture is emphasized for mapping application in 3D reconfigurable devices from HDL to configure file generation.
The SSTA maps the standard deviations with respect to the independent process and interconnect parameters to obtain the overall standard deviation of the path delay. Since SSTA needs to perform statistical yield analysis based on the process parameter statistics. A physical process parameter like the channel length, oxide thickness or threshold voltage, can be expressed as a sum of a nominal value and deviations, which are global (die-to-die or inter-die) as well as local (within-die or intra-die).
MS4 Modeling Environment is a general purpose DEVS methodology based software environment for discrete event and hybrid models. The combination of DEVS and System Entity Structure (SES) frameworks allows family of models prototyping and System of Systems Engineering (SoSE) modeling and simulation.
These errors occur in the bins whose limits are defined by taps in opposite extremes of consecutive Timing DLL's. They are the result of the presence of phase errors and of the delay cell mismatch on the Timing and Phase Shifting DLL's. Finally by analyzing the FFT plot of the TDC input clock signal based on phase noise and offset frequency, applying Beizer distribution the jitter level can be easily measured as shown in Fig. 4.
Apart from these techniques that rely on Beizer distribution, a few interesting steps have also been reported in the simulation Fig. 4 and 5 and identified the undefined (or) unnecessary routing and sink, has greater impact on Interconnectivity.
The advantage of this approach is simulation is faster than the defects or variation in their delay, sampling their limits, accuracy in estimating the parameters. The correlation covariance decreases further 0.1 at 2 mm distance. The deterministic and stochastic flows yield improvement is obtained (Chang and Sapatnekar, 2003).

Beizer Distribution
In this Component model, used a beizer distribution  to approximate a function on a bounded interval limits by passing through neighborhood control points. Consider a two dimensional Euclidean space. In this work, a Bezier curve of degree n with routing nodes V 0 , V 1 , V 2 , V n is given by parametrically by Equation (1) If X is a continuous random variable whose space is the bounded interval [a, b] and if X has c.d.f F X (.) and p.d.f f X (.), then in principle we can approximate F X (.) arbitrarily closely using a beizer curve of the form (1) by taking a sufficient number (n+1) of control points with appropriate values for the coordinates (xi,zi) T of the points V i for I = 0,……,n. If X is a beizer random variable, then the c.d.f of X is given parametrically by Equation (3): Where Equation (4): Equation reveals that the control points V 0 , V 1 , V 2 , V n constitute the parameters regulating the properties of a beizer distribution. Thus the control points must be arranged so as to ensure the basic requirements of a c.d.f:  By utilizing the beizer property that the curve described by (3) and (4) passes through the control pointsV 0 and V n exactly, also ensure that: F X (a) = 0 if we take V 0 ≡ (a, 0) T ; and also F X (b) = if we take V n ≡ (b, 1) T . Wagner and Wilson (1993), for a complete discussion of univariate Beizer distributions and their use in simulation inputmodeling.
If X is a Beizer random variable with c.d.f F X (.) given parametrically by (3), then it follows that the corresponding p.d.f f X (x) for all real x is given parametrically by: Table 1 and 2 presents the Clock selectivity and thermal profile results with reference to rooting node:

AJAS
where, x(t) is given by (4) and: In the last equation,∆x i = x i x i+1 -xi and ∆z i = z i+1 -z i (for I = 0,1,…,n-1) represent the corresponding first differences of the x-and z-coordinates of the original points {V 0 ,V 1 ,…V n } in the parametric representation (3) of the c.d.f.

Beizer-Inversion
The method for inversion is used to generate beizer random variable whose c.d.f has the parametric representation in (3) and (4). Given a random number U ~ Uniform [0, 1], for this perform the following steps: find t U ∈ [0,1] such that: This solution (5) can be computed by any rootfinding algorithm such as Muller's method, Newton's method or bisection method. Codes to implement this approach.
Note: The documented in the study on multivariate input modeling (Khul, 2010) the inversion scheme in Equation (5) and (6) for generating Beizer random variables will be a key element in this approach to building multivariate extension of the univariate Beizer distributions as well as stationary Univariate time series whose marginals are Beizer distributions.

Approach Related to C/C++ /CACTI & HDL
The CACTI-D including modeling of circuits, area, delay, dynamic energy and leakage power. In this proposed work, the results shows in Fig. 6 and 7, emphasis the modeling on beizer distribution under the PVT variations and interconnect technology modeling. By applying the analytical gate area model (Thoziyoor et al., 2008) drives to transistor sizing and leads to very minimal gate area. The advantage over the method is used to different pitch matching constraints and error raised due to the directionless routing in the bitline interconnect. The Table 1 and 2 are the parametric constraints over the rooting nodes. From the data table spatial correlation is better compared to local variables, under process variation.     In order to validate the data 65 nm 16 MB Intel Xeon L3 cache (Chang et al., 2007) and a 90 nm 4 MB sun Sparc L2 cache (McIntyre et al., 2005) is compared and computes a floating point arithmetic instruction for each cycle is executed. Further the clock latency is also found. Moreover, vary the optimization variables, the average error achieve greater than 20 to 23.5% of access time and area is obtained.

Results and Experimental Validation
The runtime achieves in this work finally drives into sink node, the weights associated is proportional to the signal delay. The algorithm runs 64-sec runtime and 1.3 Mb storage, 2.3 ps. The results represented in Fig 10a and b. The skew value for the clock tree generated by "CS" in the presence of thermal profiles, yields about 56% savings (Akhilesh and Anis, 2010;ITRS, 2003). The Fig. 8 and 9 results confirm skew variations and delay in the selected clock tree. Meanwhile the runtime and storage is taken into account for wire segments, access time is achieved extensively (Yang, 1991).
Another iteration it results smaller penalty (Aloisio et al., 2010;Mueller and Whalley, 1995), which is also a reason for a decreased quality in clock skew. This result gives a clear picture to fast access with measure and simulated techniques. It is allowed to use for memory hierarchy exploration for embedded systems.

CONCLUSION
In this study, stochastic algorithms, worked out in high level synthesis algorithms. Also mapping is done for their stochastic flows under process variation. Here, the simulation models rely on the spatial fields. In the spatial fields, they select coordinates of nodes; length and the wire sizing have been incorporated for the deterministic routing. The overall run time is reduced. The routing stage has less improvement on timing. So SSTA analysis is not reliable to any conditions for evaluation. The term obtained in the experimental set up is "directionless routing and sink node". It has been modified to extend the impact on timing driven algorithms. Hence the timing is improved; mean delay is 6.2 to 5.2% and standard deviation from 7.5 to 7.6%. Therefore the overall performance measure is >3.0. The Table 1 and 2 shows, the improvement over the algorithms, interconnect uncertainty, routing is reduced the total wire length for the deterministic routing.
However the results obtained from this approach distributes the power noise and also effectively reduces the noise. The two approaches intiates and model the interconnect uncertanity, extended to stastical optimization. The data bit streams of both original and captured signals are stored in FPGA Chip.
This technique is used in the high energy and nuclear experiments whereas the signals picked up, transfer them into simulation mode(pC). Further process them and interface with front end data acquistion system in a future study.