Content-Sensitive Data Retrieval in Sensor Networks

: Problem statement: For a sensor network comprising autonomous and self-organizing data sources, efficient similarity-based search for semantic-rich resources (such as video data) has been considered as a challenging task due to the lack of infrastructures and the multiple limitations (such as band-width, storage and energy). While the past research discussed much on routing protocols for sensor networks, few works have been reported on effective data retrieval with respect to optimized data search cost and fairness across various environment setups. This study presented the design of progressive content prediction approaches to facilitate efficient similarity-based search in sensor networks. Approach: The study proposed fully dynamic, hierarchy-free and non-flooding approaches. Association rules and Bayesian probabilities were generated to indicate the content distribution in the sensor network. The proposed algorithms generated the interest node set for a node based on its query history and the association rules and Bayesian rule. Because in most cases the data content of a node was semantically related with its interest of queries, the sensor network was therefore partitioned into small groups of common interest nodes and most of the queries can be resolved within these groups. Consequently, blind search approach based on flooding could be replaced by the heuristic-based uni-casting or multicasting schemes, which drastically reduced the system cost of storage space, network bandwidth and computation power. Results: We verified the performance with experimental analysis. The simulation result showed that both Bayesian scheme and association scheme require much less message complexity than flooding, which drastically reduced the consumption of system resources. Conclusion: Content distribution knowledge could be used to improve the system performance of content-based data retrieval in sensor networks.


INTRODUCTION
Wireless sensor networks are shared-medium multi-hop systems consisting of radio-equipped sensor nodes. These types of networks are useful any situation where temporary network connectivity and communication are needed. The sensor nodes in sensor networks are capable of not only storing and processing data, but also performing complex operations through their communications, such as similarity-based retrieval of data objects. Data management provides an essential application for sensor networks, as is witnessed by the wealth of research literature [1,2,11] . However, efficient retrieval of semantic-rich data in sensor networks is challenging due to the multiple constraints such as node mobility, computation capability, memory space and bandwidth.
To address the problem of data retrieval in sensor networks, a variety of approaches have been presented in the literature. Researchers have proposed methods based on centralized client-server architecture. Some examples of such approaches are presented in [3,4] . One common characteristic of these models is the reliance on a centralized storage (or head node) that would handle the queries from clients and forward back the results. This assumption violates the requirements of sensor networks where sensor nodes should be considered as equal peers and none of them should be given extra capability or responsibility. Moreover, the centralized models will cause single points of failure and therefore reduce the robustness and scalability.
Decentralized approaches overcome the reliance on a central directory server and employ unstructured peerto-peer connections to resolve the queries. However, these approaches do not take link cost in the consideration when computing routes [5,12] . This shortcoming causes drastic waste of system resources and makes the decentralized approaches unsuitable for practical applications of sensor networks. Some protocols have been proposed to reduce the search cost in the mobile environments, but these protocols still rely heavily on the flooding of query messages and may be too expensive to operate.
As an alternative, a great deal of research has been done on organizing sensor networks based on content distribution. Search algorithms based on landmark hierarchies were proposed in [6] . The hierarchy consists of data nodes and landmark nodes that behave as indices. Improvements to better organize the node contents and distance information by clustering the content-similar nodes and mapping them into semantic categories in the semantic space-such as semanticaware hierarchy [2] . However, topology/content changes may require costs to update such indices. Further improvements to search efficiency have led to continuing overlaid infrastructures (e.g., Chord [7] ) that use content/location mapping to direct the searches to nodes holding the requested data. This approach causes higher maintenance overhead in the process of updating the mapping relationships in accordance with content changes.
Another solution is the hash-table-derived approach. The Content Addressable Networks (CANs) provide a distributed hash table abstraction over the Cartesian space of the data objects. They employ (object, key) pairs to allow efficient storage exploitation and query processing [9] . The implementations include logical overlay networks (e.g., pSearch [5] ) that use dimension-reduction techniques to reduce search cost. The content-navigated retrieval models rely on proactive hash tables to guarantee their efficiency in the retrieval process, which also present high challenges to data location restriction and hash table maintenance in practical applications.
Motivated by the strengths and weaknesses of centralized and decentralized approaches, we propose to utilize the cooperation among sensor nodes to improve the system performance. In this study, we present two progressive content prediction schemes-Association-based Content Prediction (ACP) and Bayesian-based Content Prediction (BCP)-to facilitate content-based retrieval. The fundamental idea is to predict the location of data sources based on query history and probability rules. Based on the content distribution knowledge, the data retrieval is performed with reduced network traffic and response time due to the employment of heuristic-based unicasting instead of blind broadcasting.

MATERIALS AND METHODS
Proposed schemes: Our schemes are motivated by three observations: First, a sensor node may have specific interest and query some data objects frequently. Second, the data contents of a node are often semantically similar to the queries it has issued. Therefore, by analyzing the earlier queries delivered among the nodes, the system may forward later queries to a small collection of nodes which requested or resolved the earlier queries with similar semantics. Third, some nodes in the network may share similar interest and generate the same queries. These nodes can be grouped into common-interest clusters, in which any query can be cached and analyzed to facilitate later queries issued by nodes in the cluster. Based on the three observations, we propose to exploit the query history of sensor nodes to find the relationships among multimedia data objects and use this knowledge to determine the data contents of sensor nodes and ultimately improve the performance of content-based multimedia information retrieval.

Problem formulation:
We introduce the formalized problem description of content-based retrieval in sensor networks. We consider an sensor network consisting of resource-constrained nodes moving within a specified geographic region. The hardware capacity of the nodes may vary from each other and each node is equipped with a cache space for the query history recording and data perfecting purpose. We assume reliable pair-wise message communications between sensor nodes in the order of their generation, using some existing routing protocol, such as AODV or DSR.
We consider the type of applications in which all nodes in the sensor network share a standard representation of data objects. More specifically, the sensor nodes have a consented set of features that indicate the semantic contents of data objects.
Based on the aforementioned assumptions, we consider the sensor network as an undirected connected graph G = (N, C, W), where N = {n 1 , n 2 ,…, n r } is the set of sensor nodes, C = {c 1 , c 2 ,…, c s } is the set of wireless connections between neighboring sensor nodes and W is the pair-wise weight (or distance) between the nodes. The problem of content-based retrieval in the sensor network can be viewed as the following: Given a set of data objects X = {x 1 , x 2 , …, x m } disseminated among the set of sensor nodes N, a query data object Q and an integer k, find the minimum subset N * = {n 1 * , n 2 * , …, n s * } ⊂ N containing the top k data objects with smallest semantic distances to Q, where the similarity between data objects is evaluated using the Euclidean distance in the semantic space.

Definition 1:
The Euclidean vector norm: Given the set of data objects X = {x 1 , x 2 ,…, x m }, each object x i is represented as an n-dimensional semantic vector φ xi = (ω 1 xi , ω 2 xi ,…, ω n xi ). The Euclidean vector norm of semantic vector φ xi , also called 2-norm, indicates the length of φ xi in the Euclidean geometry:

Definition 2:
The scalar product: The scalar product of two semantic vectors φ xi and φ xj , denoted as φ xi • φ xj , is defined as: Definition 3: The semantic similarity: The semantic similarity between data objects is defined based on the Euclidean distance between their semantic vectors. Formally, the distance between two data objects x i and x j is defined as a cosine distance function: Association-based content prediction: Preliminary concepts: The distribution pattern of the data objects over the sensor nodes can be considered as a many-to-many relationship: Each data object may be distributed among multiple sensor nodes and each sensor node may contain a collection of data objects. Let χ(n j ) = { x j1 , x j2 ,…, x jv } denote the queried data object set of sensor node n j . It is obvious that χ(n j ) is a subset of X, i.e. χ(n j ) ⊆ X. Suppose there exists a list of earlier queries L = {q 1 , q 2 , …, q w } where each query q i is a data object in χ(n j ). An association rule is a repeatedly happened pattern R: Q → x jt , where Q = {q t1 , q t2 , …, q ts } is a set of queries and x jt ∈χ(n j )-Q is a data object. Here Q and x jt are called the antecedent and consequence of the association rule, respectively.

Definition 4:
The dataset support: The support of a data query set Q, denoted as ψ(Q), can be defined as the percentage in the list L of query sets that include Q as their sub sets. The dataset support can be formalized as: where, subset (X) returns all possible subsets of X, while truesubset(X, Q) only returns the subsets of L that are also supersets of Q.

Definition 5:
The association support: The support of an association rule R: Q→x jt is considered as the percentage of query sets that include both Q and x jt , or formalized as:

Definition 6:
The association confidence: The confidence of an association rule R: Q → x jt , denoted as ζ(R), can be considered as the ratio of the association support in comparison with the dataset support:

Definition 7:
The association rule set: Given a query history list L and a confidence threshold ζ th , the association rule set Rule(L, ζ th ) is the set of association rules having confidence no smaller than ζ th : Definition 8: The validated node content: Given a sensor node n i and a query history L, the data objects in the query results that are resolved from node n i are considered as the validated node content of n i :

Definition 9:
The estimated node content: Given a sensor node n i and a rule set Rule(L, ζ th ), the estimated content of n i is the set of data objects that are the consequences of the association rules after applying the validated node content of n i as the antecedents: ξ(n i ) = {x j |R:Q→x j , for ∀ Q∈ υ(n i )∧R∈ Rule(L,ζ th )} (9) Definition 10: The estimated node interest: Given a sensor node n i and a rule set Rule(L, ζ th ), the estimated interest of n i is the set of data objects that are the consequences of the association rules after applying the node interest of n i as the antecedents: λ(n i ) = {x j |R:Q→x j , forτ∀Q∈ χ(n i )∧R∈Rule(L,ζ th )} (10)

Definition 11:
The interest-content overlap: Given a sensor node n i and the query history L, the node set that contains data contents related with the interest of node n i is considered as the overlap-interest node set: Prediction algorithm: The fundamental idea is to first allow the sensor nodes to record the queries and their results passing by it as the query history and then gradually increase the size of association set based on the appearance ratio of data objects in the queries (i.e., antecedents) and results (i.e., consequences). Table 1 shows the notations used in the association rule set generation algorithm. Algorithm 1 shows the details of the rule generation process.

Algorithm 1: Constructing association rule set:
Input: query history L and confidence threshold ζ th Output: association rule set Rule(L, ζ th )  Table 2 shows the notations used in interest node generation. Algorithm 2 for computing interest node set is listed as follows.

Algorithm 2: Generation interest node set:
Input: sensor node n i association rule set Output: Interest node set of n i 1-inst(n i ) ← φ 2-while (Hq(n i ) ≠ φ) 3-Q ← £(n i ) 4-Hq(n i ) ← Hq(n i ) -{Q} 5-for i ← 1 to |node(Q)| 6-select a node n k from node(Q) 7-if (est(n k )∩ ξist(n i ) ≠ φ) 8-inst(n i ) ← inst(n i ) ∪{ n k } 9-return inst(n i ) The number of nodes in the sensor network qry(L) The function of selecting a query from L rslt(Q) The function of obtaining the result of query Q sup(X) The function of computing the support of data set X The proposed algorithms generate the interest node set for a node based on its query history and the association rules. For the simplicity of computation, the algorithm employs the estimated interest of the input node to filter out other nodes that do not contain overlapping data contents. Therefore, a node n i only shares its interest with a small collection of nodes. When a new query is issued, n i will first forward it to the interest-related node set, with the purpose of resolving the query with a small portion of the network. If the query cannot be resolved in this interest node set, then use flooding to find the result. Due to the semantic locality of queries, most queries will be processed in the small set of nodes.
In the aforementioned algorithms we address finding the content distribution knowledge of the sensor network to facilitate similarity-based data retrieval. In comparison with the conventional centralized or flooding methodologies, the proposed content prediction approach has the following properties: • The fundamental idea is based on the association rules generated from the analysis of the coincidence pattern of data objects in the earlier queries and their results. Because in most cases the data content of a node is semantically related with its interest of queries, the sensor network is therefore partitioned into small groups of common interest nodes and most of the queries can be resolved within these groups. Consequently, blind search approach based on flooding can be replaced by the heuristic-based unicasting or multicasting schemes, which drastically reduces the system cost of storage space, network bandwidth and computation power • The proposed content prediction approach does not rely on any fixed or overlaid infrastructure, which not only adheres to the infrastructure-free nature of sensor networks, but also greatly reduces the system overhead of updating the indexing or mapping information in the content/location servers, improving the robustness and scalability Bayesian-based content prediction: Methodology statement: Suppose we are given a collection of data objects X = {x 1 , x 2 ,…, x m } that are disseminated among the sensor nodes. Each data object is assigned a probability of being accessed. After a period of time, some data objects are requested in the query history, which changes the access probabilities. The interest node set of ni Hq(ni) The history of queries issued by node ni £(ni) The function of obtaining a query issued by ni node(Q) The function of finding the nodes resolving Q est(nk) The function of estimating the content of node nk ξist(ni) The estimated the interest of node ni Table 3: Notations for Bayesian content prediction Items Notations P(xj∈ ni) The probability of data object xj existing in ni P(X|ni, tk) The mapping of all X's data objects in node ni query(tk) The function of taking a query from iteration tk obj(Q) The function of taking an object from query Q satify(Q, ni) Whether the query Q is satisfied at node ni Assume that for each data object x i there exists some underlying probability distribution P(.|x i ), referred as the fundamental relevance probability. The initial probability model can be constructed using many existing analysis systems, e.g., Latent Semantic Analysis (LSA) [5,10] . The data objects will then be retrieved with a collection of queries, which can be considered as a series of iterations-each iteration containing the query results and their relevance to the query objects. For a specific iteration t k , when a new query Q comes, the data retrieval system can use a deterministic strategy to select the objects with the highest Bayesian probabilities P(x = Q|t 1 … t k ).

Definition 12:
The Bayes' rule: Based on above descriptions of data retrieval, the Bayes' rule can be defined as: where, P (t 0 ) = 1.

Content prediction:
The Bayesian-based content prediction is a process of estimating the probability of the user's next query being satisfied by the data content of a sensor node according to the observation of its resolution of earlier queries. For a node n i , the query history is divided into a series of iterations t 1 , …, t k , each iteration containing a collection of queries and the results (i.e. whether the requested data objects are found at n i ). Table 3 shows the notations used for Bayesian content prediction. The detailed process is described in algorithm 3.

Algorithm 3: Bayesian content prediction:
Input: Query iterations t k data content mapping P(X|n i , t k-1 ) before t k Output: data content mapping P(X|n i , t k ) after t k 1-for i ← 1 to | t k | 2-Q ← query(t k ) 3-for j ← 1 to |Q| 4- x ← obj(Q) 5-if (satify(Q, n i )) 6-P(X|n i , t k ) ← P(X|n i , t k-1 )P(x j ∈n i ) 7-else 8-P(X|n i , t k ) ← P(X|n i , t k-1 )(1 -P(x j ∈n i )) 9-return P(X|n i , t k ) With the help of Bayesian-based content prediction, the retrieval of data objects becomes a clearly aimed process, navigated by the probabilistic distribution of data objects among the nodes: • When a query (i.e., data object x q ) is issued to node n i , first check whether it can be resolved locally • If x q is not locally resolved, compare the probabilities P(x q ∈n 1 ), …, P(x q ∈n r ), find the node with highest probability • Update the data content mapping P(X|n i , x q ) using the content prediction algorithm Based on this strategy, the queries may be resolved with reduced search cost in terms of network traffic and computation complexity, because the query messages are only forwarded to the nodes with highest Bayesian probabilities. This approach is also self-stabling due to its capability of updating the probabilistic information after the resolution of queries. This maintenance process may cause system overhead; however, this overhead can be amortized on each query can be alleviated.

Performance analysis:
We compared the performance of the two proposed schemes using a simulator implemented in ns-2 environment. The simulated sensor network consists of up to 200 sensor nodes joining/leaving the network according to a Poisson process during a period of 8000 sec. The data object set comprises up to 12,000 data points whose semantic vector values are assigned by a random number generator abiding by normal distribution in the interval [0,1). Each node randomly selects waypoints within a 2000*1000 m flat area. The node density can be adjusted by changing the number of nodes in the area. The query generation pattern follows the exponential distribution, which is similar to the previous study [8] . To provide the storage for recording earlier queries, we also assigned each node with a 1MB fixed-size cache.
One of important metrics for data retrieval system is the accuracy. The performance of a search scheme can be evaluated by its accuracy when accessing a restricted portion of nodes. In the simulation we restricted the search schemes to visit less than 10% of the nodes. Figure 1 compares the accuracy of query results returned by Bayesian, association and flooding schemes as the number of queries increases. Generally speaking, more query history improves the accuracy of all three schemes. Flooding schemes improves its accuracy due to larger number of earlier queries being cached, which increases the possibility of local resolution. The association-based scheme performs worse than Bayesian at the beginning due to the poor stability of association rules generated with small sample set of queries. As the query history becomes longer, the association rules precisely describe the relationships among data objects and therefore help improving the retrieval accuracy.
In another simulation run, we measured the impact of node density on query response time. Figure 2 shows the result. As can be seen from Fig. 2, the association-based scheme achieves shortest response time. This is due to the fact that data objects can be prefetched with the help of association rules, reducing the average search delay. Note that the flooding scheme outperforms the Bayesian scheme on response time due to its capability of finding the shortest path to data sources using query broadcasting. However, this shorter response time of is obtained at the price of large message complexity. In Fig. 3 we compare the average number of messages required to resolve a query. As one can conclude, both Bayesian scheme and association scheme require much less message complexity than flooding, which drastically reduces the consumption of system resources.

DISCUSSION
The proposed scheme makes use of data content distribution in sensor networks to resolve video queries without incurring flooding in the network. Through theoretical and experimental study, we found that the proposed scheme has the following features: • Association and Bayesian probability based content estimation • Constraint-based representation method showing the semantic similarity between query objects • Non-flooding query processing • Adaptive accuracy and search cost performance CONCLUSION Content distribution knowledge can be used to improve the system performance of content-based data retrieval in sensor networks. In this study we proposed Association-based Content Prediction (ACP) scheme and Bayesian-based Content Prediction scheme to facilitate the efficient light-weight data retrieval. Association rules and Bayesian probabilities are generated to indicate the content distribution in the sensor network. The simulation result shows that these schemes drastically improve system performance in terms of accuracy and search cost.