Application of Syndication to the Management of Bibliographic Catalogs

: Problem statement: The process of transmission of bibliographic records between libraries is a complex task, usually treated by the Z39.50 protocol. Approach: The objective of this research is to propose an alternative method to simplify this process, using the techniques of content syndication. Results: The computer program compares the feasibility of using different formats (ATOM, RSS1.0, RSS2.0 and MARC-XML) to convey and share library catalogs of various sizes (up to 1 millon records) between libraries. Tests have shown that smaller collections of 25,000 records, the time insertion/import catalogs is less than 1 min. Conclusion/Recommendations: The analysis suggests that syndication is a useful technique for the transmission and retrieval of bibliographic information.


INTRODUCTION
Content syndication refers to a technique for transmitting information in XML (Brickley and Guha, 2004) through channels or sources that can be updated and shared with any client on the web.
Dave Winer (2001) is one of the pioneers of this technique. It was applied initially to the mass media (New York Times, 2009). Now it has been extended to various areas, including the academic field, specializing in textual and audio-visual content.
Applications in Librarianship consist of channels for literature alert or with general information (ANU, 2010). It has been also used for the dissemination of articles and scientific journals (Abadal et al., 2006) and for the selective dissemination of information in digital libraries (Peis et al., 2008).
In this study we deal with the use of syndication in the library field, analyzing the possibilities of transmission and retrieval of textual bibliographic catalogs using content syndication.

Methodology and results:
We had first to develop test collections of different sizes from the bibliographic catalog of the LC (2012) through very general topic queries. Their names and dimensions can be seen in the Table 1: Once the collections were in CSV format, a program (developed in PHP) converted them into raw XML format. The collections were split in groups of one thousand records each to improve the later input speed of data in My SQL. We call this process data conversion. The Table 2 summarizes the conversion times obtained: Then the several collections were syndicated in various formats (ATOM, RSS 1.0 and RSS 2.0), to compare the performance of different syndication formats, adding another one (MARC-XML with two variants: short and extended) not currently considered as a syndication format, though widely used in the library world.
A structure for each format was created. To do it we had chosen all the labels that, being used in practice, were more useful to describe textual (not multimedia) bibliographic records, avoiding possible loss of information. In the case of the MARC format it was considered bibliographic corpus consisting mainly of monographs and the use of the Dewey decimal classification, as the collection source was from the Library of Congress. The structure chosen for the ATOM format (The ATOM Syndication Format, 2005) is as Fig. 1.
Tags chosen to set the channel in RSS 1.   From the specifications of RSS 2.0 format (RSS 2.0 specification, 2003) the structure has been configured as Fig. 3.
To set records in MARC-XML format, the specifications of the Library of Congress and the MARC Standards Office (MARC-XML schema, 2009) were used. The structures of the short and extended versions are shown in Fig. 4 and 5 respectively: From the collections shown in Table 1, a program in PHP creates the channels in different formats with previous structures. Creation times obtained are as follows Table 3. Once all collections (from 1000 up to 1 million records) were syndicated in different formats, the diffusion process of data in a channel from the server to the client computer is simulated by an import program (also developed in PHP) whose main goals are: • Identify the format of the channel through its extension • Create a data table in MySQL with a structure adapted to the syndication format • Sequential reading of each bibliographic record. The time spent in this process is the transfer time between the server and the client computer. This transfer time in practice depends on many factors, including the bandwidth of the network, the processing speed of the client computer or system memory. Therefore this time has not been considered in this study • Insert each bibliographic record (in groups of one thousand, having been found that this level of clustering minimizes the insertion time) in the data table. This process has been called data insertion. In this study, since transfer time has not been considered, insertion time represents the total import time of syndicated sources The results obtained in this process, considering the different collections and different formats are summarized in the following Table 4:

RESULTS AND DISCUSSION
We see that each format has a different capacity for the syndication of bibliographic catalogs. This is due to a different internal structure, which allows or denies the input of certain types of information from the bibliographic record. From a librarian point of view, a bibliographic record consists of the following essential information areas: title and statement of responsability, edition, material or type of resource, publication, physical description, series and notes (IFLAI, 2007). We have added the possibility to include the full bibliographic record, because it would help to retrieve any information or specific secondary access point that would not have been considered in the original format. The following Table 5 summarizes the information areas that each format can include:  This Table 5 shows that the most suitable format for bibliographic records syndication is the family of MARC formats, as they can include all the essential areas of bibliographic description with the level of completeness desired. This possibility of complete full bibliographic description justify why it is not allowed to add the full bibliographic record, as it would duplicate the information.
Nevertheless, it would allow displaying the full record and the access to any data in a single field if it was necessary.
Between syndication formats, RSS1 is the best suited to a textual bibliographic record, mainly because of the possibility of including Dublin Core and PRISM modules. However several shortcomings have been detected, such as the impossibility of including areas of publication, physical description and series. These problems can be overcome at least partially due to the possibility of including a full bibliographic record field on which to add such data.
RSS2 and ATOM have a similar level of adaptability. Both have a low capacity to represent a textual bibliographic record, basically limited to the area of title and statement of responsibility. RSS2 presents the added disadvantage of not including the full bibliographic record. This problem can be solved, but in an unorthodox way, by adding modules initially designed for the RSS1 format (Dublin Core and PRISM) after introducing the respective namespaces. This solution would originate a hybrid format far away from the original, becoming a substitute of RSS1.
Another relevant aspect has to do with the time of creation of syndicated channels, analyzing its relationship with the size of the files in the different formats. To observe this relationship, first we show in the Fig. 6 the sizes of the syndicated files.
As shown in the graph, a difference in the structure of the formats affects the size of the channels. Logically, the more complex and extensive is the structure, the bigger is the size of the channel. Formats range from the simplest, RSS2, to the most complex, MARC-XML extended.
However, the differences in the size of the channels don't imply large differences in the time of creation of such channels, as shown in the Fig. 7.
In fact, although the one million bibliographic records syndicated in MARC-XML extended becomes a 2673 MB channel and the same collection in RSS2 becomes a 854 MB channel (three times less), the creation times are 1091 and 1036 sec respectively, which is around 1 min of difference. Even more, as the size of the initial collection reduces, the differences in creation times also reduce, as can be seen in Table 3. In summary, the format (despite the structural differences between them) does not have a major effect on the time of creation of syndicated channels.
In relation to the import times (considering only the insertion time into the database), the corresponding Table 4 can be summarized graphically as follows: The Fig. 8 shows that the difference in insertion times between the different formats is high when the initial collections are big, but this insertion time decreases as the collection becomes littler.  For the collections of a million records, the maximum difference obtained (between MARC extended format and ATOM format) is around 15 min. For the collections of half a million records, the maximum difference is approximately 8 min between the same formats. For the collections of 250,000 records, the maximum difference is around 4 min. This maximum difference is 1.7 min with the collections of 100,000 records, less than a minute (50.53 sec) in the case of 50,000 records and finally it is 3.3 sec in the case of 1000 entries.
In summary, although the time of creation of the channels does not almost depend on the size of collections, the time of insertion/import strongly depends on the size of collections. The data obtained let us conclude that syndicated data insertion/import is less than a minute, regardless of the format chosen, only when the collection does not exceed 25,000 records. For collections equal or greater than 250,000 records, the insertion/import times are considerably large.

CONCLUSION
The analysis suggests that content syndication is a useful technique for the transmission and retrieval of textual bibliographic data, being an alternative to the use of Z39-50 protocol, more complex and difficult to use. The smaller is the syndicated collection of records, the more useful is this technique. When the collection is smaller than 25,000 records, the insertion/import time is less than a minute, regardless of the format chosen. Accordingly, this technique is well suited for updating library catalogs or for the maintenance or management of large databases that are fed from multiple sources.
MARC-XML has been shown to be the most complete format, because it has specific tags for any bibliographic data. RSS1 has also shown useful and versatile for the representation of bibliographic records due to the inclusion of various specialized modules of description, such as Dublin Core and PRISM. Although it lacks the areas of publication, physical description and series, it has an area of content that lets you insert the full bibliographic record, overcoming these deficiencies. In any case, this tag would help to display and retrieve any kind of bibliographic information, regardless of the format used. According to these criteria, RSS2 and ATOM are the less suitable syndication formats for the transfer of bibliographic data.
The analysis suggests that there are no appreciable differences in the time of creating the channel, whatever the chosen format and the complexity of its structure. However, we have found that the greater is the complexity of the format, the greater is the size of the channel, implying an increase in the insertion/import time. Thus, although it is technically feasible to syndicate collections of any size, only with collections smaller than 25,000 records the insertion/import time is less than one min.

Further research directions:
Further research could analyze the syndication of bibliographic or library collections in real network environments, primarily to determine what factors influence primarily on their performance. In this environment it could be possible to compare their performance with Z39.50 protocol.
Another direction for future research is the development of techniques for retrieving information over syndicated library collections through XQuery or XPath filtering techniques and their comparison with the usual MYSQL techniques.