Fals-Ism: A Graph Isomorphism Framework for Multi-Level Detection of Falsified PDF Documents

: Fake Portable Document Format (PDF) documents are disseminated in an incredible rhythm across social media. Negative incidences are obvious but effective solutions identifying falsified items in the PDF are still in need. Unlike determining malicious scripts inserted into the file, this research aims at identifying falsified objects from different layers of the document. Specifically, we introduce Fals-Ism, a novel approach to detect falsified PDF documents based on graph isomorphism. Each document is transformed and characterized by metadata, structure, and content required to build the corresponding graph such that any alteration is reflected on the complete graph. The graph is input to the isomorphism search algorithm namely; VF2 to verify if there is a similarity-based isomorphism. Experiments are conducted on (36) PDF documents considering metadata, structure, and content modifications. The results show that Fals-Ism (i) Is efficient to detect forgery at metadata level, structure, and content; (ii) Is robust and resistant to forgery attacks such as insertion, deletion, and modification of information; (iii) Does not require certain information about the PDF documents beforehand to perform the detection. Fals-Ism can detect different types of falsifications in PDF (version 1.7 or higher) with an accuracy of 90%. A comparison with similar work confirms that Fals-Ism could be a complementary tool for fake news detection.


Introduction
Portable Document Format (PDF) has become the most popular and widely used format for describing digital documents worldwide (Bradley, 2011).In recent years, social networks have allowed malicious people to distribute falsified documents, making fraud on social networks an economic issue.There are a lot of investments to fight against that.People are looking to equip themselves with powerful fraud detection tools, regardless of their field of activity (Shu et al., 2017;Yang et al., 2019;Elhadad et al., 2019;Tchakounté et al.,´ 2020a;Kaliyar et al., 2021).According to research firm PCW 1 , 47% of companies surveyed globally said they were victims of social media fraud every 24 months in 2020 (Rivera et al., 2020), compared to 46% in 2022(PWC, 2022).These 1 Price Water house Coopers (PCW) statistics reveal that despite sophisticated fraud detection tools, the concern remains.
Today, physical information is digitized to facilitate its manipulation with technologies.Organizations need to archive information for easy storage and for future processing.For this, digital transformation is ensured by digitization in PDF documents, allowing distribution via means of communication.The critical issue of these processes is to guarantee the integrity of the document; digital files are easier to falsify than paper documents.This aspect should be carefully considered when designing and choosing digital file formats (Bradley, 2011).Digital files such as PDFs are one of the most secure file formats when it comes to falsification (Bradley, 2011;Laptev et al., 2017).This is why this format is widely used to store information.
Current approaches to PDF integrity primarily look for malware infiltration and document tampering.Regarding malware infiltration, the authors rely on the machine and deep learning techniques to profile normal and abnormal PDFs using images and other features (Gebhardt et al., 2013;Beaugnon and Husson, 2017;Khan et al., 2018;Ayinala and Grandhi, 2021).Regarding document forgery, Khanna et al. (2008); Elkasrawi and Shafait (2014); Abed (2015) attempt to identify the scanned and the printed version of a PDF document based on the use of texture analysis.With the use of cryptography, the counterfeit is identified by mechanisms that avoid escaping signatures linked to the document (Perry et al., 2000;Picard et al., 2004;Cheddad et al., 2008;Schouten and Jacobs, 2009;Ibrahim et al., 2010;Yang, 2014;Tchakounté et al., 2020b;Tchakounte et al., 2021).Research attempts have been proposed for the recognition of falsification with artificial intelligence based on imaged features extraction (Van Beusekom et al., 2013;Patgar and Vasudev, 2013;Bertrand et al., 2013;2015;Patgar et al., 2014;Abramova, 2016;Laptev et al., 2017).While all of these works have potential, there are still shortcomings to overcome.Detecting inserted malware requires in depth dynamic analysis while manipulating the document to study the different flows and activities performed.With the heterogeneous aspects of the software and the PDF document, this is hardly feasible.But the simple variation of the content of the document modifies the signatures and the profiles of the document.The document will therefore be wrongly classified as false.
In this study, with the assumption to have the original PDF document, we propose a similarity matching approach to recognize falsified PDFs.The proposed method, namely Fals-Ism, relies on a robust theoretical and well-proven tool to solve similarity assessment problems; graph isomorphism.A document is profiled in three levels e.g., metadata, structure, and contents.Based on these profiles, we convert the target document into a graph that we compare against the original document graph, similarly profiled.With graph isomorphism matching principles, we extract exactly where the alterations are applied in the falsified PDF document.The manipulation of PDF documents requires a deep understanding of their anatomy.The sections give a basic idea of how to construct, structure, and secure PDF files.All the information is based on the PDF standard, ISO/IEC 32000-1:2008.

PDF Anatomy
A PDF supports eight basic data types.Each type corresponds to a specific set of values described as follows.Boolean type is represented by the keyword true or false.Number type refers to integer and real.Strings type can be characters between brackets "()" or hexadecimal data between quotation marks "<<>>".Type names are sequences of characters with the null character as an exception.The special character "/" named slash is used to enter the name type.A type of "array" that can hold multiple object types including names, strings, and arrays.Type "dictionaries" which is similar to a dictionary containing the description followed by a word.The type description can contain objects or another dictionary.The "stream" type such as strings in programming but can have an unlimited length.Streams are a special type for holding big data that a simple "string" cannot hold.Finally, the null object is an empty object represented by the symbol null.
The structure of PDF files determines (i) How objects are stored, (ii) How they are accessed, and (iii) How they are updated in a PDF document.This structure is independent of the semantics of the objects.All PDF files have a common structure which is subdivided into 4 parts: The header, body, cross-ref table, and trailer.The header is the file header and is the first line of the source code of a single line PDF.This part contains five characters, we have "% PDF-" associated with the version number.The part of the body that contains the document content.The body represents the actual content of the data that makes up PDF documents.The cross-ref table is the most important part of the document structure.The trailer known as the final part of the file, is used to find the cross-ref table and several useful objects in the file.The table presents the different keys, their values, and types that can be encountered as end entries.
The structure of a PDF document specifies how the basic object types are used to represent its components (pages, fonts, annotations, etc.,).A PDF file's document structure consists of a number of objects arranged in the body in a hierarchical fashion.They are arranged in a page tree according to the document catalog's specifications, where they are divided into page objects.This catalog includes references to other objects that detail the document's content and instructions on how that content should be displayed.It relates to a number of things, but the page tree a structure that arranges and makes all page objects accessible is the one that matters the most to us.For an intensive discussion on PDF documents, we refer the reader to Adobe Systems 2008.

Graph Isomorphism
A graph G is a pair G = (S(G), A(G)), where the elements of S(G) are the vertices or nodes and A(G) is the set of edges with An application f: S(G) −→ S(H) is a graph morphism if the image of any edge of a graph G is an edge of a graph H. Mathematically, this is given as then G and H are homomorphic if there exists a morphism between them.The application f realized an isomorphism when G and H are homomorphic and each of them is bijective, i.e., there exists a univocal relation f: S(G) → S(H) such that: which means G  H.

Materials and Methods
The proposed approach is divided into four modules.The first module is the features extraction in a PDF document, the second module consists in transforming the features into graphs; the third module search for the isomorphism between the graphs, and the last module is where the decision is made.

Mathematical Formulation
We are interested in graph isomorphism to detect a forgery in a PDF file.We consider the texts in the PDF document as a graph whose nodes are words and edges are the semantic relationship in the document.Since a PDF document can be decomposed into a graph, an isomorphism between two PDF documents confirms that both PDF documents are equal.Let G be a graph representing a PDF file, G can be decomposed into a set of n subgraphs G1, G2,•••, Gn such that there exists a strict structural relationship ℜ between the subgraphs i.e., if G is a PDF file then: where all the Gi are subgraphs of G and the notation ℜ(G1, G2,…, Gn) means there are strict structural relationships between two successive subgraphs Gi and Gi+1.Two PDF files are identical if and only if there exists an isomorphism between their subgraphs.This is written mathematically as follows.Let G and H be two PDF files with subgraphs G1, G2,…,Gn and H1, H2,…,Hm respectively; G and H are identical if n = m and for all i = 1, 2,…, n there exists an application f: S(Gi)  S(Hi) such that: The problem that rises is how to find f.There are two types of algorithms for finding isomorphisms between graphs.The first one is exact matching which includes algorithms that look for a perfect match between two graphs before considering them as isomorphic.The second one is inexact matching or fault tolerance matching including algorithms qualified as inexact because they relax constraints allowing some errors and noise when searching for an isomorphism between two graphs.To allocate a redundancy of edges and vertices, they demand that each vertex of the first graph be able to map distinct vertices in the second graph regardless of the edge orientation between the vertices (Wang et al., 2018).
The algorithm for finding an isomorphism between two PDF files used in this study is the VF2 (Cordella et al., 2001).In spite of the size and type of the graph that needs to be matched, the exact matching method VF2 has consistently been able to solve the isomorphism problem (Cordella et al., 2004).The VF2 algorithm was proposed by Cordella et al. (2001) for large graph adaptation.It is a heuristic method with features inherited from the VF algorithm (Cordella et al., 1998;1999), which reduces VF memory space from O(n2 ) to O(n) where n is the number of vertices of the graph.The VF2 checks an isomorphism as follows: For two graphs G and H, a state S0 to initialize all data structures is set.The matching method receives an instance of the algorithm, which checks to see if two pairs of vertices originating from G and H match up.If the verification returns true then the candidate pairs are added to the set of pairs corresponding to both G and H and the procedure is repeated until all candidate pairs are tested.If a match is found then a return is made to the last candidate pair found otherwise no match is returned.

Algorithms and Flowchart of the Proposed Method
We propose an approach that includes three algorithms.Algorithm 1 is the feature extraction level which is responsible for extracting the features in a PDF document.The input of this module is a PDF document and the output is the metadata, structure, and content of the PDF.To extract different features that make up a PDF file, the pdf reader library 2 is used.This library allows access to all the features in a PDF file.The outputs of Algorithm 1 are inputs of Algorithm 2 which is responsible to transform the inputs into graphs.Algorithm 3 is the module where the isomorphism is checked.The inputs of these modules are several graphs and the output is Boolean indicating if the input graphs realize an isomorphism or not.

Experimental Validation
This section evaluates Fals-Ism with several tests.Ten experiments are carried out using different sample datasets.The tests are carried out following three types of falsification attacks such as (i) Insertion of information, which consists of editing the content of the PDF document, (ii) Alteration or deletion of some information that is part of the content of the PDF document and (iii) Modification of information, which consist to change an element in the document.

Datasets
We simulate 36 samples of PDF documents and split them into 10 sample datasets.Because in real life, forgery can occur anywhere in a PDF document, to cover many of these aspects, we have chosen to take several variants of PDF documents as follows: 1. Twenty of the PDF documents contain only raw texts 2. One of the PDF documents contains texts and images 3.One of the PDF documents is generated by scanning the original physical PDF document as an image without applying Optical Character Recognition (OCR) 4. One of the PDF documents is generated by scanning the original physical PDF document as an image with Optical Character Recognition (OCR) 5.One of the PDF documents is generated by scanning a physical document as an image with Optical Character Recognition (OCR) and still contains images 6.One of the documents contains texts and tables 7. One of the PDF documents contains texts, mathematical equations, and symbols 8.One of the PDF documents contains a signature 9. Seven of the PDF documents are from PDF versions 1.0-1.6 10.Two of the PDF documents are of versions greater than or equal to 1.7

Experiments
On each sample dataset, tests are performed on three levels (validation stages) before a decision is made on the authenticity of the document.The first validation step deals with the metadata, the second with the structure, and the third validation step with the content.At each of the above steps, different types of falsification such as Insertion, deletion, and modification of information are introduced on each PDF document and submitted to Fals-Ism for detection.Figure 2 shows the detection pattern made by Fals-Ism on the content of all 10 samples of the dataset.The x-axis represents the 10 sample datasets used to test Fals-Ism and the y-axis is the system/Fals-Ism capability.The capability is expressed in three bits meaning that there are 23 possibilities in Table 1.For example, the value 7 refers to 111 in binary.This means that the system was able to detect the falsification by insertion, modification, and deletion of information on the content of the PDF document that makes up this experiment.The value 5 refers to 101 in binary.It means that the system was able to detect insertion and deletion but not modification of information.Results in Fig. 2 show that experiments 1, 4, 6, 7, 9, and 10 are falsified because all bits are 111 which implies falsification by insertion, deletion, or modification.The curve decreases slightly from experiment 2 and drops completely to zero in experiment 3. The same is observed in experiments 5 and 8.
Contrary to the type of falsification in the metadata or the content of a PDF document, the only type of falsification of the structure of a document is the falsification by modification (for example one can modify the font of the characters in a PDF document) but with no deletion or insertion, reason why we observe only one bit on the curve corresponding to the falsification by modification of the PDF document structure in each experiment, where 1 means that the detection was a success and 0 otherwise.
In experiments 1, 2, 4, 5, 6, 7, 8, and 10 in Fig. 3 the curve is constant everywhere except in experiments 3 and 9.This means that the system was able to detect all the modifications made to all the structures of the PDF documents.However, in experiments 3 and 9, the bits are 0, which means that the system was unable to detect the falsification carried out on the structure of these PDF documents.

Table 1: Capabilities interpretations Capabilities
The system can detect insertion but is not able to detect deletion and modification 0 1 0 The system is not able to detect insertion and modification but can detect deletion 0 0 1 The system is not able to detect insertion and deletion but can detect modification 1 1 0 The system can detect insertion and deletion but cannot detect modification 1 0 1 The system can detect insertion and modification but cannot detect deletion 0 1 1 The system can detect deletion and modification but cannot detect insertion 1 1 1 The system is able to detect insertion-deletion and modification Step 2 Step 3 1 The system has detected deletion, The modification The system has detected the insertion, modification and insertion of the text in made on the structure of this PDF document modification and deletion of information made content of the document the metadata of this PDF document 2 The system has detected the deletion and insertion The system has detected the modification made The system has detected the insertion, of an image in the content of the document but the structure of this PDF document modification and deletion of information made cannot detect when modification is made on the metadata of this PDF document to the structure of the image or if the image even has been replaced by another 3 The system was not able to detect The system was not able to detect falsifications The system was not able to detect falsifications falsifications that were made on the that were made on the structure of this PDF that was made on the metadata of this PDF content of this PDF document document document 4 The system detected the insertion, The system has detected the modification The system detected the insertion, modification deletion or modification made on the made on the structure of this PDF document and deletion of information made on the content of the PDF document metadata of this PDF document 5 The system has detected the deletion or insertion The system has detected the modification The system has detected the insertion, of an image in the content of the document but made on the structure of this PDF document modification and deletion of information made cannot detect when modification is made to on the metadata of this PDF document the structure of the image or if the image in question has been replaced by another 6 The system detected the insertion, The system detected modification The system detected the insertion, deletion and deletion and modification of a table in a made on the structure of this PDF document modification at the metadata level of this PDF PDF document document 7 The system detected the insertion, deletion The system detected the modification The system detected the insertion, deletion or or modification made on equations made on the structure in this PDF document modification made on the metadata in this PDF and symbols in the PDF document document 8 The system did not detect the insertion, The system has detected a modification made The system detected an insertion, deletion and deletion and modification of the signature on the structure of this PDF document modification of information made on the on this PDF document metadata of this PDF document 9 The system detected the insertion, The system did not detect the modifications The system detected the insertion, deletion and deletion and modification of information made on the structure of PDF with version modification made on the metadata of this PDF the content level of this PDF document less than 1.7 document 10 The system detected the falsifications The system detected falsifications of types insertion, The system detected falsifications of types of types insertion, modification, and deletion modification and deletion at the level of the insertion, modification, and deletion at the level made at the level of the contents of these metadata of the PDF documents of metadata of the PDF documents PDF documents and the falsifications of types modifications made to their structure In experiments 1,2,4,5,6,7,8,9,and 10 Fig. 4, all the points correspond to 7 i.e., 111 bits in binary.This means that the system was able to detect all insertions, deletions, and modifications of information in the metadata of these PDF documents except at the level of experiment 3 where the system was totally unable to detect any type of falsification performed on this PDF document.
Table 2 shows the test results for each of the three steps and for all experiments, we carried out with the 10 sample datasets.The first column corresponds to the sample dataset.The three other columns refer to the level of experiments: Content, structure, and metadata.At each level, we apply the three types of falsification: insertion, deletion, and modification.

Results and Discussions
Figures 1-4 depict that Fals-Ism has the following capabilities:  Fals-Ism is an effective tool for detecting PDF documents consisting only of raw text in terms of detection at the content level.The system is able to detect even if two characters are exchanged in the document  The system is able to detect any manipulation performed by inserting, deleting, or modifying information  The results obtained show that the proposed method is effective in terms of detecting falsification at the metadata level with a detection rate of 90% Fals-Ism has some weaknesses as observed:  We observe that the performance could decrease with the number of pages.Experiments with more complex and very large PDF files are required  Detection at the document structure level remains uncertain except for PDFs with versions greater than or equal to 1.7.Other versions should be deeply investigated  The system is completely unable to perform forgery detection against PDF documents scanned as an image or generated directly as an image  The system is completely unable to detect a forgery made by, for example, changing a signature in a document

Comparison with Similar Work
A Decentralized Document Management System (DDMS) was suggested by Han et al. (2021) to increase the security of digital documents in order to secure them.For greater document security, DDMS distributes access rights to a number of users by symmetrically encrypting the document with a key and dividing the key using Shamir's secret sharing.Each split key is managed using blockchain and when the document is retrieved using a developed smart contract, the whole symmetric keys are rebuilt.The proposed DDMS can provide stronger security with a fair performance overhead.Smart contracts and blockchain technology are used by Serranito et al. (2020) to build a decentralized verification solution for university diplomas and other higher education qualifications.An actual blockchain is used to test a prototype of the implementation and the challenges that were faced are identified and assessed with an emphasis on how well the decentralization mechanism worked.According to the authors, the technology enables higher education institutions to store certificates they issue in the blockchain where hiring companies may verify their integrity and legitimacy.Laptev et al. (2017) suggested attacking the PDF file's source code to make the process of identifying modifications in PDF files easier.Results show that the method is efficient for analyzing PDF files to determine their integrity.
GraDID is proposed by Jung et al. (2022) to determine if a document substitutes another one.The authors studied the consistency of the body context of a document formalized as a graph of nodes where the node is taken as the whole text.Unlike our approach, this way of doing is coarse grained and not precise.Moreover, this study is limited when the structure and metadata are altered.Kada et al. (2022) proposed a way to identify fake identity documents based on the reconstruction of holograms.Unlike ours, this proposal considers general holograms and therefore lacks precision.Patil et al. (2022) relied on sequencing the order of pixels to discover irregularities in handwritten documents.This type of document is not within the scope of this research.
Methods based on blockchain technology offer a very good level of security, but the implementation of this technology is very complex and requires a lot of resources in terms of machine power, storage space, etc.All these parameters imply a huge financial cost, which is why in Ali and Bhaya (2021) the implementation of this solution has not yet been deployed for use in real life.The high cost of implementing these kinds of solutions does not benefit some small institutions that do not have enough money to buy these systems, yet they also need to secure themselves.This is where Fals-Ism shows its potential as very simple and easy to implement and does not require a lot of computing resources for deployment.Gunawan et al. (2021) exploited blockchain technology to verify the authenticity of academic degrees and certificates.However, it is not possible to localize where falsifications appear as proofs.
We implement and analyze the method in (Laptev et al., 2017) and compare results to Fals-Ism.Note that the results in (Laptev et al., 2017) are discussed using a dataset of 9 PDF documents each of which has only one page.The performance investigation of the method proposed in (Laptev et al., 2017) with our datasets provide a robust analysis when compared to the nine other PDF documents that are initially tested in (Laptev et al., 2017).We observe the following: Laptev et al. (2017) method is effective in detecting traces of manipulation made against an image in a document, unlike Fals-Ism, which only detects if the image has been deleted  Another strength of Laptev et al. (2017) is that it can determine the specific software that was used to edit the PDF document  Laptev et al. (2017) require knowledge of some information about the PDF beforehand, unlike Fals-Ism  Laptev et al. (2017) rely on only forgeries using the most common editing software such as adobe photoshop or PDF creator.Fals-Ism operates independently on the PDF editing software

Conclusion and Perspectives
The aim of this study was to propose a system to detect falsified PDF documents, based on graph isomorphism.A general study to understand the structure of PDF documents by analyzing their anatomy was carried out.The extraction of the features of a PDF document allowed us to extract parameters such as structure, content, and metadata from the PDF document.We then translate the extracted features into a graph then an isomorphism check is performed on each graph in order to verify if the PDF is falsified.Experiments were carried out on 36 PDF samples dataset.A comparison of the proposed method was carried out with similar work in the literature and results show that the proposed method is effective and resistant against attacks on the insertion, deletion, and modification of information in the content, structure, and metadata of PDF documents.
Fals-Ism is performed in three levels to look for falsified items.This solution is costly in terms of execution time.The future investigation will be to meticulously optimize its complexity.This aim could be achieved by relying on machine learning automation.

Table 2 :
Result obtained on the whole samples ExperimentsStep 1