Design of QoS Aware Dominating set based Semantic Overlay Network (QADSON) for Content Distribution

Problem statement: This study addressed the user’s perceived quality of service requirements in content distribution and investigat ed he role of QoS. Aware Dominating set based Semantic Overlay Network (QADSON) in surrogate serv er selection to achieve the specified quality of service. Approach: At first, we constructed the QoS aware dominating s et based semantic overlay network which was a virtual network of surrogate se rvers that was built on top of existing physical network with the purpose to implement new network s ervices and features such as efficiency, exact-1 domination, controlled redundancy and fault toleran ce that are not available in the existing network. We applied EFRRA content replication algorithm to d isseminate the content among the surrogate servers and evaluated its performance in QADSON. Results: We assessed the efficiency, exact-1 domination, controlled redundancy and fault resilie ncy of QADSON in terms of mean response time, mean CDN utility, hit ratio percentage, rejection r ate and CDN load. We extended the simulation experiments to analyze the role of QADSON in mainta ining uniform CDN utility of above 0.95. Conclusion: We also investigated the quality of service requir ements for the content distribution and evaluated performance of QADSON based CDN in terms of mean response time, latency, hit ratio percentage, mean CDN utility, rejection rate and CD N load.


INTRODUCTION
A Content Distribution Network (CDN) consists of many surrogate servers located at different locations which can be clustered or grouped together to form a surrogate server site, so that a client has a good connectivity to at least one of the surrogate servers. These surrogate servers have to cooperate with each other to enhance the performance of the content delivery network and meet the user perceived Quality of Service (QoS). A CDN provider mainly focuses on providing the following services and functionalities: storage and management of content, distribution of content among surrogate servers, cache management, delivery of static, dynamic and streaming content, back up and disaster recovery solutions, monitoring services, performance monitoring and reporting.
This paper dealt with the implementation of Quality of Service aware dominating set for selecting the replication set of surrogate servers. Then EFRRA content replication algorithm is applied to disseminate the content among the surrogate servers in the QADSON based CDN. Performance measurement is carried out to estimate the values of performance indices such as size of CDN, mean response time, mean CDN utility, latency, hit ratio percentage, number of completed requests, rejection rate and CDN load which gives an indication of system conditions and used to identify the factors that influence the design of CDN and its performance, assisting the content providers in decision making and achieve efficiency, load balancing and fault tolerance in massive content distribution systems. Pathan and Buyya (2008) presented a comprehensive taxonomy with a broad coverage of CDNs in terms of organizational structure, content distribution mechanisms, request redirection techniques and performance measurement methodologies. Their surveys focused on understanding the existing CDNs in terms of their infrastructure, request-routing mechanisms, content replication techniques, load balancing and cache management. Rodriguez and Biersack (2002) proposed a dynamic parallel-access scheme to access multiple mirror servers. They showed that their dynamic parallel downloading scheme achieves significant downloading speedup with respect to a single server scheme. However, they studied only the scenario where one client uses parallel downloading. The effect and consequences when clients choose to adopt the same scheme is not addresses. Cherksova and Kee (2002) proposed Fast Replica algorithm to distribute the content, in which a user downloads different parts of the same file from different servers in parallel. Once all the parts of the file are received, the user reconstructs the original file by reassembling the different parts. Lu et al. (2008) proposed a novel content push policy, called TRRR i.e. Tree-Round-Robin-Replica which yields an efficient and reliable solution for distributing large files in the content delivery networks environment. They carried out simulation experiments to verify TRRR algorithm in small scale and demonstrated that TRRR significantly reduces the file replication time as compared with traditional policies such as sequential unicast and multiple unicast.
Dominating sets have been used by Han and Jia (2005) in topology control for wireless Ad hoc networks. Ma et al. (2005) used Dominating sets for virtual backbone creation in sensor networks. Shakkottai and Johari (2010) proposed a hybrid content distribution system that combines the features of peer-to-peer and a centralized client-server content distribution system. Xia et al. (2009) considered a two-tier content distribution system for distributing massive content and proposed popularity-based file replication techniques within the CDN using multiple hash functions. Ozkasap et al. (2009) proposed and designed a peer-to-peer system; SeCond, addressing the distribution of large sized content to a large number of end systems in an efficient manner. It employed a selforganizing epidemic dissemination scheme for state propagation of available blocks and initiation of block transmissions. They showed that SeCond is a scalable and adaptive protocol which took the heterogeneity of the peers into account. Stamos et al. (2008) have presented a generic nonparametric heuristic method that integrates caching and content replication to improve the performance of CDN in terms of availability and cache hit ratio. They developed a placement similarity approach, called SRC, for evaluating the level of integration. Byers et al. (1999) proposed a parallel accessing scheme based on tornado codes in which a client is allowed to access a file from multiple mirror sites in parallel to speed up the download. Pathan and Buyya (2009) presented architecture to support peering arrangements between CDNs, based on a Virtual Organization (VO) model. Performance can be achieved through proper policy management of negotiated Service Level Agreements (SLAs) between peers. They also presented a Quality of Service (QoS)driven performance modeling approach for peering CDNs in order to predict the user perceived performance. Their approach has the provisions for an overloaded CDN to return to a normal state by offloading excess requests to the peers and also providing concrete QoS guarantee for a CDN provider. Geetha and Narayanan (2011) presented a survey on current trends and methods in video retrieval. The major themes covered by their study include shot segmentation, key frame extraction, feature extraction, clustering, indexing and video retrieval-by similarity, probabilistic, transformational, refinement and relevance feedback. This study assisted the upcoming researchers in the field of video retrieval and facilitate them in know about the techniques and methods available for video retrieval. Nandagopal and Uthariaraj (2011) proposed Multi Criteria Resource Selection (MCRS) algorithm which considered multiple criteria such as processing power, workload and network bandwidth of the resource during resource selection. They conducted simulation experiments to evaluate the performance of the algorithm and compared its performance with conventional single criteria algorithm. Ramadoss and Rajkumar (2007) considered a system for the semiautomatic annotation of an audiovisual media of dance domain, DMAR (Dance Media Annotation, authoring and Retrieval system). Their work outlined the underlying XML Schema based content description structures of DMAR and discussed the merits and demerits of their approach of evolving semantic network as the basis for the audio-visual content description. Further, they proposed a quality metric, fidelity to evaluate the expressive power of the dance annotations. Evaluation results are presented to depict the performance of the dance video queries in terms of precision and recall. Caviglione and Cervellera (2011) proposed an overlay Content Distribution Network (CDN) which can able to sustain the real-time delivery of data streams. They modeled a predictive control scheme to enhance utilization of resources and evaluated the effectiveness of the proposed solution during multimedia streaming and interactive grid data. Amutharaj and Radhakrishnan (2008) constructed a dominating set based overlay network to optimize the number of servers for replication. They investigated the use of Fast Replica algorithm to reduce the content transfer time for replicating the content within the semantic overlay network and compared its performance with sequential unicast, multiple unicast content distribution strategies in terms of content replication time and delivery ratio.  proposed a hybrid replication strategy named EFRRA algorithm which combined the features of both fast replica and tornado coding algorithm. It provides an efficient and resilient content replication solution. They performed both analytical study and empirical study to analyze the performance of EFRRA and proved that EFRRA algorithm outperforms other algorithms in terms of replication time and also maintain the competent delivery ratio.

MATERIALS AND METHODS
Design of QoS aware dominating set based semantic overlay network: Semantic Overlay Network 'G' can be defined as follows: where, V = {V 1 , V 2 , V 3 , .. V n } be the set of surrogate servers and E is the set of edges between i th surrogate server and j th surrogate server i.e. E= (V i , V j ) such that V i ≠ V j . Let D be the dominating set of G and D ⊂ G, the server not in D is adjacent to at least one surrogate server in D. Hence, all the surrogate servers are either member of D or V\D. QoS Aware Dominating set is formed with an aim to construct a dominating set with the properties such as Efficiency, Controlled Redundancy and fault-tolerance. Efficiency can be achieved by implementing the principle of dominate every vertex in the adjacent set at least once. Cardinality Redundancy of a graph R (G) can be defined as the minimum number of vertices in the adjacent set dominated more than once by a dominating set. Controlled redundancy can be incorporated with the dominating set by maintaining the minimum cardinality redundancy value. Fault Tolerance is defined as the ability of the network to provide service even when it contains a faulty component or components. Fault tolerance property can be ensured by modeling the behavior of a network in the presence of a fault and can be analyzed by determining the effect that removing an edge(link) or a vertex(server failure) from its underlying graph.

QoS aware dominating Set based SON (QADSON) construction algorithm:
Step 1: Mark all the vertices of the graph white.
Step 2: Select the vertex with the maximal number of white neighbors.
Step 3: The selected vertex is marked black and its neighbors are marked gray.
Step 4: The algorithm then iteratively scans the gray nodes and their white neighbors and selects the gray node or the pair of nodes (a gray node and one of its white neighbors), whichever has the maximal number of white neighbors.
Step 5: The selected node or the selected pair of nodes is marked black, with their white neighbors marked gray.
Step 6: Once all the vertices are marked gray or black, the algorithm terminates. All the black nodes form a connected dominating set (CDS). // QoS aware dominating set formation steps Step 7: After forming the DS, check the degree of each vertices of the connected dominating set.
Step 8: If the degree of any vertex is more than one then marks that vertex gray and find the suitable alternate vertex as the member of the dominating set: // exact-1 domination Step: 9: Check whether the following criterions are satisfied by the DS: where, D is a dominating set // Redundance (Control) Criterion: • For any tree T in D with n≥2, there exists a vertex v∈V, such that γ(Y-v) = γ (T) // Fault Tolerance criterion The QoS Aware dominating set formation algorithm is applied to form the semantic overlay network of surrogate servers which are connected logically to provide the logical infrastructure of the CDN, in which any replication algorithm can replicate the content.

Design of an efficient and Fault Resilient Replication Algorithm (EFRRA):
Working principle of EFRRA: A novel algorithm called EFRRA is proposed for an efficient and fault resilient replication of large files in the CDN. Working mechanism of EFRRA can be summarized as follows.
In order to replicate a large file among 'n' nodes, the original file is partitioned into 'n' sub files of equal size and each sub file is transferred to 'n' different nodes in the group. Each node is provided with a distribution list of 'm' surrogate servers. In the second round, each node propagates its sub file to 'm' surrogate servers in the distribution list. This propagation is repeated until 'k' number of file fragments would be reached by all the 'n' recipient surrogate servers in the surrogate server site. Thus instead of the typical replication of an entire file to 'n' nodes by using 'n' internet paths connecting the original node to the replication group, this replica algorithm exploits n*n diverse internet paths within the replication group where each path is used for transferring 1/n th of the file. Hence, the bandwidth requirement is reduced by a factor of 1/n.
Step 1: Distribution of content to surrogate servers: As shown in " Fig. 1," the originator node N 0 opens n concurrent network connections to nodes {N 1 ...N n } and sends to each recipient node N i (1 <= i <= n) the following: • A distribution list of nodes R = {N 1 , ...,N m } to which sub file F i has to be sent on the next step • Sub file F i Step 2: Adding fault resiliency to EFRRA: It keeps the main structure of the EFRRA replication algorithm practically unchanged while adding the desired property of resilience to node failure. To maintain the resiliency the surrogate servers in the network are exchanging the heartbeat messages with their origin server. The heartbeat messages from surrogate servers to their origin server are augmented with additional information on the corresponding algorithm. Once the content is distributed in the network, the receiver has to recollect all the content from the network in a parallel manner.
For example, if surrogate server N 1 fails during transfer, then this may impact all surrogate servers N 2 ……N n in the network because each node depends on node N 1 to receive sub file F 1 . In the described scenario as shown in Fig. 2, surrogate server N i is acting as a recipient server in the replication set. If a surrogate server fails when it acts as the origin surrogate server N i , this failure impacts all the surrogate servers in the replication group which may be the replication sub tree rooted in surrogate server N i .
Step 3: Adding resiliency during content collection at the receiver: Once the entire file is distributed to all the surrogate servers in the overlay network of surrogate servers using step1 then the recipient node or client node has to recollect all the sub files or blocks of the requested file from the overlay network of surrogate servers in a parallel manner. Recipient node retrieves the original source file, in the form of a sequence of 'k' encoded packets, along with additional redundant packets, are transmitted by the sender and the redundant data can be used to recover lost source data at the receivers. Here retransmission of lost packets will not be needed. In this collection step also, EFRRA algorithm maintains resiliency against surrogate server failure and link outages.
In the ideal case, when k=m, every surrogate server N i holds all of m sub-files of original file F and reorganizes them to form the Original file F in the local node. When the user requests file F from the origin server, the request will be redirected to one surrogate server in the list {N 1 , N 2 ... Nm} and download the whole file F.

RESULTS AND DISCUSSION
Analytical study: Let Time i denote the transfer time of file F from the origin server N 0 to surrogate server N i as measured at N i . Average replication time is considered as a performance measure.  In idealistic setting all the nodes and links are homogeneous and let each node can support 'n' network connections to other nodes at B bytes/sec. Then: Performance of content distribution algorithms in an 'n' server semantic overlay network: Time taken for distributing the content over the Semantic Overlay Network by different content distribution algorithms are presented in Table 1. Therefore, Replication Time proportion of different content distribution algorithms can be expressed as follows: Performance of content distribution algorithms in QoS aware dominating set based semantic overlay network: QoS Aware Dominating set D is a set of 'r' dominating surrogate servers in surrogate server set V and V\D is the set of all the adjacent vertices of dominating node set D such that the dominating set would satisfy QoS requirement criterions such as exact-1 domination, efficiency, controlled redundance and fault tolerance. Each vertex 'u' in dominating set 'D' has to dominate every vertex atleast once. So each vertex in D has more or less same number of neighbor nodes which are members of the adjacent servers set V\D. So contents are only replicated in the dominated set of surrogate servers D instead of V. Suppose Cardinality of D is 'r' or a value less than 'r' then the contents will be replicated in utmost 'r' number of surrogate servers which is always less than 'n'. i.e., D V ≤ . Therefore, Replication Time proportion of different content distribution algorithms such as sequential unicast, multiple unicast, Fast Replica (FR), Resilient Fast Replica(R-FR) and Optimal Fast Replica (O-FR), Tornado Codes and EFRRA in QADSON can be expressed as follows: r : 1: 2/r : (2+m/r)*1/r : (( k+r ) / r*r*k ) : 2*c / r : ( 2 +c ) / r where r < n (6)

Simulation test bed and performance measurement:
We  Figure 3 shows the average replication time measured by different individual surrogate servers for different file sizes of 100 KB, 750 KB, 1.5 MB, 3 MB, 4.5 MB, 6 MB, 7.5 MB, 9 MB, 36 MB, 54 MB, 72 MB and 128 MB when QADSON based replication set of surrogate servers. High variability of average replication time under Multiple and Sequential Multicast is identified for larger file sizes. Average content replication time of EFRRA algorithm across large file sizes in QADSON based replication set is much more stable and predictable. Hence, EFRRA algorithm outperforms all the traditional content distribution schemes.

Performance of EFRRA content replication algorithm in QADSON in terms of average replication time:
We measured the average replication time of EFRRA to replicate the different sized files in the SON, DSON, EDSON and QADSON and the performance graph is depicted in Fig. 7. It is observed that average replication time of EFRRA is very less in QADSON based replication set.

Delivery ratio Vs surrogate server failure fraction:
The delivery ratio is defined as the ratio of the number of data packets successfully received by the recipient surrogate server to the number of data packets sent by the source surrogate server. The worst case delivery ratio of EFRRA in SON, DSON, EDSON and QADSON when the number of simultaneous surrogate server failures in the CDN has been analyzed and its performance is shown in Fig. 5.
From the delivery ratio analysis shown in Fig. 5, we found that the delivery ratio of EFRRA algorithm in QADSON based CDN is above 0.97 always, even though the surrogate server failure fraction reaches 0.5. Hence, QADSON based CDN is found to be fault tolerant and efficient during surrogate server failure.
Analyze the impact of QoS aware dominating set based SON in CDN formation: By the implementation of QoS Aware dominating set for the clustering of surrogate servers in the SON, the average number of surrogate servers for content replication is reduced to 55 percentages or less. This is depicted in Fig. 6. Figure 6 Reduction in Replication Set due to the impact of Equitable Dominating Set in CDN Formation Analyse the role of QoS aware dominating set in surrogate server utilization: CDN utility is the mean of the individual net utilities of each surrogate server in a CDN. Net utility is a value that expresses the relation between the number of bytes of the served content against the number of bytes of the pulled content from origin or other surrogate servers. Net Utility (U i) of a surrogate server can be given by the formula: α -ratio between uploaded bytes to downloaded bytes.

Mean response time Vs number of clients:
Simulation experiment is conducted by fixing the input values for following parameters: • Maximum number of requests = 1,000,000 • Number of file objects = 50000 • Maximum website size = 1GB  Fig. 9.
Mean response time Vs number of requests: Another finding is, when number of clients is fixed in a network and the number of requests increases then the mean response time of QADSON based content distribution network is always less than EDSON based CDN, DSON based CDN and SON based CDN which is depicted in Fig. 10. Latency Vs file size: Latency is defined as the interval between the time the user requests for certain content and the time at which it appears in the user browser or is available at client machine. The end user perceived latency is a useful metric to select the suitable surrogate for that user. In our CDN system, each CDN node determines its set of neighbours using latency information. However, different sized files have different latencies and web objects can essentially be of any size. Hence, we need techniques to estimate the latency of downloading an object as a function of file size using only a limited number of probes. Our measurements from simulation experiments showed that the average network latency of downloading a file is roughly proportional to its size when the file size is between 100KB and 128 MB and is depicted in Fig. 11. From the measured values of latency, it is found that latency in QADSON based CDN is lesser than EDSON based CDN, DSON based CDN and SON based CDN for different sized files ranges from 100KB and 128 MB.

Number of requests Vs hit ratio percentage:
Generally surrogate servers serve contents to the clients from its cache. Hit ratio percentage is the ratio between the number of contents a surrogate is serving and the number of content request it is receiving. A high hit ratio indicates an effective cache management policy, content distribution policy and surrogate server cooperation. It improves network performance and bandwidth saving. Simulation experiment is conducted by fixing the input values for following parameters: • Number of clients = 100000 • Number of file objects = 50000 From experimental results plotted in Fig. 12, we can see that for particular number of request, hit ratio percentage of QADSON based CDN is always higher than hit ratio percentages of EDSON based CDN, DSON based CDN and SON based CDN. Also in QADSON based CDN infrastructure most of the time the surrogates are able to serve the request, as load is almost equally balanced among the surrogates, so redirection probability is less. But in EDSON, DSON and SON based CDN request redirection probability is higher and in worst case it may happen that there is no requested content in a surrogate. So the surrogate redirects the requests to other surrogates that have those contents or sometimes to the origin server itself.
Rejection rate Vs number of requests: Rejection rate is defined as the percentage of dropped requests due to service unavailability. It depends on the number of disruptions due to service unavailability in the network. Low rejection rate indicates that users experiencing the high service availability. From Fig. 13, it is observed that QADSON based CDN has low rejection rate lesser than 1.08% due to fault tolerant property of QADSON which is very low compared to the rejection rate observed in EDSON based CDN, DSON based CDN and SON based CDN.
CDN load vs total number of requests: CDN load can be defined as the ratio of mean request arrival rate (i.e. number of requests arrived per second) and mean service rate. From the experimental results, we found that when the number of requests increases in the CDN, the CDN load always increases. But the increase in load in QADSON based CDN is uniform and in between 0.6 to 0.7. It is also observed that the CDN load of QADSON based CDN is always less than the CDN load of EDSON based CDN, DSON based CDN and SON based CDN which is depicted in Fig. 14.

CONCLUSION
In this study, first we constructed QoS Aware Dominating set based Semantic Overlay Network (QADSON) of surrogate servers to form the logical infrastructure of CDN. Then we applied EFRRA content replication algorithm to disseminate the content among the surrogate servers in QADSON.
We have conducted simulation experiments using CDNsim and analyzed the performance of EFRRA algorithm in terms of average replication time, delivery ratio in SON, DSON, EDSON and QADSON based CDN.
The effect of QoS Aware dominating set in SON formation and how it was useful in reducing the redundancy, improving the efficiency and maintaining fault tolerance were investigated. It is also observed that QoS aware dominating set based SON is useful in keeping the mean response time stable and much more predictable and further noticed that Mean CDN Utility is uniform and above 0.95. We also evaluated the performance of QADSON based CDN in terms of Size of CDN, Latency, hit ratio percentage, rejection rate and CDN load. Our future work includes design of virtual organization based peering of cooperative and coordinated CDNs and evaluation of its performance.

ACKNOWLEDGMENT
The researchers acknowledge TIFAC-CORE in Network Engineering (established under the Mission REACH Program of Department of Science and Technology, Govt. of India) for providing necessary facilities in Open Source Technologies Laboratory for working on this project. The authors would also thank Kalasalingam Anandam Ammal Charities for generously providing professional development allowance to support this study.