Measuring Test Data Uniformity in Acceptance Tests for the FitNesse and Gherkin Notations

Corresponding Author: Douglas Hiura Longo Department of Informatics and Statistics, Federal University of Santa Catarina, Florianópolis, Brazil Email: douglashiura@gmail.com Abstract: This paper presents two metrics designed to measure the data uniformity of acceptance tests in FitNesse and Gherkin notations. The objective is to measure the data uniformity of acceptance tests in order to identify projects with lots of random and meaningless data. Random data in acceptance tests hinder communication between stakeholders and increase the volume of glue code. The main contribution of this paper is the implementation of the proposed metrics. This paper also evaluates the uniformity of test data from several FitNesse and Gherkin projects found on GitHub, as a means to verify if the metrics are applicable. First, the metrics were applied to 18 FitNesse project repositories and 18 Gherkin project repositories. The measurements taken from these repositories were used to present cases of irregular and uniform test data. Then, we have compared the notations from FitNesse and Gherkin in terms of projects and features. In terms of projects, no significant difference was observed, that is, FitNesse projects have a level of uniformity similar to Gherkin projects. However, in terms of features and test documents, there was a significant difference. The uniformity scores of FitNesse and Gherkin features are 0.16 and 0.26, respectively. These uniformity scores are very low, which means that test data for both notations are very irregular. Thus, we can infer that test data are more irregular in FitNesse features than in Gherkin features. The evaluation also shows that 28 of 36 projects (78%) did not reach the minimum recommended measure, i.e., 0.45 of test data uniformity. In general, we can observe that there are still many challenges in improving the quality of acceptance tests, especially in relation to the uniformity of test data.


Introduction
Analogous to Test-Driven Development (TDD) (Beck, 2003), Acceptance Test-Driven Development (ATDD) includes different stakeholders (client, developer, tester) who collaborate to write acceptance tests before implementing system functionality (Gärtner, 2012). Teams involved with ATDD generally find that, only by defining acceptance tests and discussing test specifications, there will be a better understanding of the requirements. This happens because acceptance tests tend to force the need for a solid agreement on the exact behavior that is expected from a software (Hendrickson, 2008). According to Santos, (Longo and Vilain, 2018), there are 21 techniques used to specify acceptance tests. Acceptance tests specifications can be done by using semi-structured formats, tables, diagrams, or other Domain-Specific Languages (DSLs).
It is estimated that 85% of software defects originate from ambiguous, incomplete and illusory requirements (Torchiano et al., 2007). Specifying software requirements using acceptance tests is an attempt to improve the quality of requirements. However, several problems can arise from the specification of requirements using acceptance tests, as it also happens with the specification of requirements using natural language. For example, by 136 using natural language, readers and writers may use the same word to name different concepts, or even express the same concept in completely different ways (Sommerville, 2011).
Most notations of acceptance tests are composed of functional data and test data (Druk and Kropp, 2013). Functional data is an artifact that is used to connect test data to the System Under Test (SUT). The connection is done through glue code, which must follow the template of the framework that is being used to execute the tests. Test data are used to set up the SUT and the output data expected from the SUT. Test data are represented by words or expressions, usually with a flag to differentiate it from functional data. Longo and Vilain (2018;Longo et al., 2019) define that the test data can be either uniform or irregular. Uniform test data are expressions that are common to various test documents and irregular test data are composed by single expressions that are not repeated through test documents. Figure 1 shows an example of an acceptance test in Gherkin notation with uniform and irregular test data. This acceptance test in the Gherkin notation deals with the login functionality feature and has two scenarios. The uniformity and irregularity of test data can be verified by comparing the test data from the two scenarios. For example, the test data value 'SOFTWARETETINGHELP.COM' appears in both scenarios and is considered, therefore, uniform. Nevertheless, the test data values 'Mary' and 'John' and 'PASSWORD' and 'PASSWORD1' are considered irregular because they are not repeated in both scenarios.
In general, several features are specified for the development of an application in which acceptance tests are included. These features are usually organized into separate documents, they can be specified at various times throughout the development process and their specifications can be done by different people. In this way, maintaining the uniformity of test data can be a challenge because there can be a lot of test data that are expressed in completely different ways but with the same meaning. For those involved in specifying a test, communication between tests using irregular data may be feasible, with little or no information loss, since humans are able to interpret the irregularities of test data and understand the meaning of the test. However, there may be problems associated with unintentionally irregular data and glue code reuse for test automation (Longo et al., 2019). For example, in order to automate the scenarios of the login functionality feature (Fig. 1), some settings of the SUT are required. It is necessary to set up 'Mary' for the first scenario and 'John' for the second scenario. Nonetheless, the setup could be more easily reused if test data were uniform. According to Greiler et al. (2013a), test code duplication should be avoided by code reuse. Code reuse can facilitate maintenance activities, as a smaller volume of code is easier to handle. Figure 2 shows an improvement in the uniformity of the test data presented in Fig. 1. The improvement refers to the uniformization of usernames 'Mary' and 'John'. In the example in Fig. 1, the test can be understood and performed with unformalized test data. More uniform test data can cause a reduction in the fixture settings in the glue code. Figure 3 shows the glue code for the examples in Fig. 1 and 2. In the glue code example, lines 6 and 7 are highlighted because they can be discarded for the example with the most uniform data (Fig. 1).

Fig. 1:
Sample of acceptance test in Gherkin with uniform and irregular test data (adaptation from Softwaretestinghelp, 2020) 137  Fig. 1 and 2 with highlighting of lines neglected by the better data uniformity (adaptation from Softwaretestinghelp, 2020) This lower volume of glue code achieved by more uniform test data means less development and maintenance effort. Still, some gain in communication can be obtained, because if we compare the test example in Fig. 2 and the test example in Fig. 1, we can observe that the more uniform test data clarifies the meaning of the scenarios in Fig. 1: The first scenario with a successful authentication system and the second with an incorrect password failure. In the test example in Fig. 2, this communication by test is vague and cannot be perceived through the test data. However, it is worth mentioning that irregular test data is important in certain occasions. For example, the fact that test data 'PASSWORD' and 'PASSWORD1' are irregular helps to understand the difference between the first and second scenarios.
Hence, typically, irregular test data should be avoided, unless there is a strong semantic motivation to distinguish one test data from another.
Acceptance test in FitNesse and Gherkin notations are widely adopted according to (Park and Maurer, 2008;Coutinho et al., 2019). Thus, in this study we propose specific metrics to be used for these notations. These metrics are based on the one proposed to the User Scenarios through User Interaction Diagram (USUID notation) . We also evaluate the uniformity of the acceptance test data of several projects that use these notations and present a comparison of the uniformity between FitNesse and Gherkin.
The evaluated acceptance tests were collected from GitHub, a platform that hosts millions of open-source 138 projects. Thirty-six projects from GitHub were extracted, 18 projects using FitNesse notation and 18 projects using Gherkin notation. For each project, we collect data uniformity measures and descriptive measures as well. Then, from the collected measures for each project, we compare FitNesse and Gherkin notations in order to investigate if there is a difference between the uniformity of these notations.
This article is organized as follows: Section two presents some related works. Sections three and four present the metrics proposed for the FitNesse and Gherkin notations. Sections five and six present two case studies showing the applicability of the proposed metrics. Section seven presents a comparison between these case studies. Section eight presents potential threats to the validity. Section nine presents the paper's discussions and conclusions.

Related Works
The two main related works are Longo et al., 2019). Longo and Vilain (2018) propose a kind of metric for measuring data uniformity in automated acceptance tests in the notation of User Scenarios through User Interaction Diagrams (US-UIDs). Longo et al. (2019), the authors elaborate an experiment with the treatment of data uniformity as the control factor. The conclusions were that with the treatment of data uniformity, both the required volume of glue code and the time spent to automate the tests were reduced.
Some studies that are focused on acceptance tests investigate whether non-technical individuals could write executable specifications based on notations like FitNesse, US-UIDs and Gherkin (Melnik and Maurer, 2005;Alvestad, 2007;Longo and Vilain, 2015a;2015b;. Other studies focused exclusively on FitNesse notation were conducted by (Ricca et al., 2008;2009). Most of these studies use qualitative measures as expressed in (IEEE Std 830, 1998) or quantitative measures such as time, for evaluation or comparison. Metrics that are more accurate for evaluating user stories were proposed by (Lucassen et al., 2015;2016) and applied by (Lucassen et al., 2017). However, these kinds of metrics are specific to the user story format and have not been adapted for automated acceptance test notations.
Other studies, such as (Greiler et al., 2013a;2013b), focus on problems in automated tests known as bad code smells. The solution to identifying bad code smells is usually the generation of a report with a set of specific measures. With the help of these reports, programmers can make balanced decisions and refactor test code in order to avoid bad smells.
These previous studies have focused on general metrics looking to evaluate and compare automated acceptance tests, as well as to identify test problems. Yet, none of them have proposed metrics for data uniformity that are specific to FitNesse and Gherkin notations. In other words, to the best of our knowledge, no other objective metrics to assess the uniformity of acceptance test data has been found, other than Longo et al., 2019), especially for FitNesse and Gherkin techniques. In addition, there is lack of basic studies comparing the different notations of automated acceptance tests that consider a large volume of projects.

Metrics of Data Uniformity for FitNesse
As mentioned before, this paper proposes two kinds of metrics for measuring the uniformity of acceptance test data for the FitNesse and Gherkin notations, respectively. For the FitNesse notation, in general, the metrics are applied to a set of wiki pages that represents acceptance tests. A set of pair-wise wiki pages is generated from the page set. The uniformity values for each pair of wiki pages are calculated by counting the number of uniform test data that are common to both wiki pages and the number of test data that are only presented in one of them. Finally, general uniformity is the average of uniformity of the pairs in the set.
This section presents the metrics for measuring uniformity of tests in FitNesse notation. The proposed metrics are based on the work of  in which a metric for calculating uniformity of tests using the US-UID notation is proposed. We present the metric to calculate the data uniformity for Fitnesse through a math model. By applying this math model we can calculate the data uniformity of each pair of Fitnesse feature and the data uniformity of entire project as well. The Table  1 shows the related works.

Metric Input
Metric input is a set of FitNesse features. A FitNesse feature is a wiki page with tests. This set of features is used to measure data uniformity. Figure 4 presents an example of a feature and the indication of input and output elements of the test data that will be used as input for the proposed metrics. The set of features that will be the input of the metrics is represented by the following equation: Where: ω = A set of features τt = The t-eth feature of set ω and must be denoted according to Eq. 2 d = The number of features within set ω

FitNesse Feature
Test data of the features are organized into tables, as seen on Fig. 4. The table caption (i.e., "should I buy milk") and column headers (e.g., "cash in wallet" and "go to store?") consists of functional data. The proposed metric does not use functional data. The input and output data are shown in the body of the

Feature Pairs Generation
A set of pairwise combination of features is generated from the input ω. The pairs in this set are used later for calculating the metrics. The set of pairs is denoted as follows: Where: ψ = The set of feature pairs generated from ω (t, t) = A pair of features generated from different features that belong to ω t = The t-eth feature that belongs to set ω. The variable t assumes the same values as t, that is, (q(t = 1; d)) t  t determines that the pair must consist of different t-eth and t-eth features. For each pair (t, t), auxiliary metrics of absolute uniformities is obtained. Auxiliary metrics of absolute uniformities are obtained by counting uniform and irregular data. In this way, these auxiliary measurements are used for the creation of the metric for relative uniformity, which is applied to each pair of features. The goal of the relative uniformity metric is to obtain a uniformity value that can be applied in the 140 comparison between different pairs of features. The pairs of features will be used in the metrics of relative and absolute uniformity and that is why they were defined before the metrics.

Auxiliary Metrics of Absolute Uniformities
The metrics of absolute uniformity are sectioned by input and output data. The metric for absolute uniformity of input data is represented by the following equation: The metric for absolute uniformity of the input data is the sum of the input data from the t test page that also belongs to t. The expression tjt means that input data tj which belongs to t, also belongs to t.
The metric for absolute uniformity of output data is represented by the following equation: The metric for absolute uniformity of the output data is the sum of all outputs of t that also belong to t. The expression otl t means that the output data of the test otl, which belongs to t also belongs to t.

Relative Uniformity Metric
The relative uniformity metric is defined from the auxiliary metrics of absolute uniformity. Its goal is to assign a numerical value to the uniformity of the data. The metric is applied to a pair of features and the result is the ratio between the sum of the absolute uniformity metrics and the amount of input and output data for a pair of features. Thus, the relative uniformity metric for a given pair of features is represented by the following equation: The relative uniformity metric is a value within the [0, 1] interval (zero to one interval). Relative uniformity metric always assumes values in the 0 to 1 interval, regardless of the number of inputs and outputs contained in the features and for this reason, it is called relative uniformity. Value 1 (one) represents the maximum uniformity and value 0 (zero) represents the maximum irregularity. If (n + m) = 0, then the uniformity value is 1, i.e., in the case that there is no test data on the feature, then a uniformity value of 1 is adopted. The main goal of relative uniformity is to create a normalized scale that enables the comparison between distinct pairs of features.
The relative uniformity metric is calculated for each pair of features, so the arithmetic mean between them can be adopted as the descriptive measure of uniformity for all pairs of a project. Thus, from the relative uniformity metric for feature pairs, we propose the relative uniformity metric for the entire project. The relative uniformity metric for a project is the sum of the uniformity metric values of each pair of features divided by the total number of pairs. The relative uniformity metric for a project is obtained from the following equation:

Metric Implementation
The implementation 1 of the metrics is performed with the FitNesse tool, making it possible to extract the uniformity measures by running the FitNesse tool. The entire code for calculating data uniformity was written in the Java programming language. In the source code of the FitNesse tool, the class fitnesse.testsystems. TestSystemListener allows the interception and monitoring of the execution of tests. Thus, this class was used to obtain input and output data of the tests. After obtaining the test data, the proposed metrics are applied and the uniformity measures of each FitNesse project is extracted. In order to obtain the test input and output data, it is necessary to run the tests using FitNesse. However, there can be some computational costs for the processing of the tests.

Metrics of Data Uniformity for Gherkin
Data uniformity metrics for the Gherkin notation is similar to the uniformity metrics for the FitNesse notation, except that there is no classification of input data and output data in the Gherkin notation. Input and output data in the Gherkin notation are only classified by the developer according to the meaning of the test information. Thus, Gherkin itself does not distinguish between one type of test data from another, i.e., for Gherkin, everything is just test data. Figure 5 shows an example of a feature in Gherkin notation with a highlight in the test data. Test data are encapsulated within the descriptions and they are identified by being in quotes, by its formatting or by the developer's understanding upon reading the test. An example of test data, in Fig. 5, is the expression "Expensive Therapy". This test data appears along with the text describing the keyword When and it is enclosed in quotation marks, which identify it as a test data.
We present the metric to calculate the data uniformity for Gherkin through a math model. By applying this math model we can calculate the data uniformity of each pair of Gherkin feature and the data uniformity of entire project as well. The metrics can be applied to a set of features with acceptance tests. Then, a set of pairwise feature combinations is generated from the set of features. The uniformity of each feature pair is calculated by counting the uniform and irregular data points and applying them to an equation. The equation is the ratio between irregular test data and the total amount of test data. The total uniformity for a pair of features is the arithmetic mean between the uniformity of each feature from the pair.

Metric Input
A metric input is any set of features. Data uniformity metrics is extracted from these features. The notation for the set of input features is: Where:  = Any set of features in the Gherkin notation t = The t-eth feature in set and must be denoted according to Eq. 9 d = The number of features in 

Gherkin Feature
A Gherkin feature is a test document and is denoted as follows: Where: tj = The j-eth test data of the t-eth feature n = The number of test data in t

Feature Pairs Generation
A set of pairwise combination of features is generated from the ω input. The pairs in this set are used later for the metrics calculation. The set of feature pairs is denoted as follows: ,..., , 1; , Where: φ = The set of feature pairs generated from  (t, t) = A pair of features generated from different features that belong to set  t = The t-eth feature that belongs to set . The variable t assumes the same values as t, that is, (q (t = 1; d)) t  t = A restriction rule, that is, a pair must be formed by distinct features.
For each pair (t, t) the auxiliary metric, called metric of absolute uniformity, is obtained. Then, using the auxiliary metric, the metric for relative uniformity is formulated. The objective of the relative uniformity metric is to obtain a uniformity value that can be applied to compare different pairs of features.

Absolute Uniformity Metric
The absolute uniformity metric is represented by the following equation: The absolute uniformity metric is the sum of test data from feature t, which also belongs to t. The expression tj t means that test data tj, which belongs to t, also belongs to t.

Relative Uniformity Metric
The relative uniformity metric for Gherkin features is elaborated from the absolute uniformity metric. The objective of the relative uniformity metric is to summarize the uniformity of the data in a numerical value. The metric corresponds to the ratio between the absolute uniformity metric and the bulk of data. Thus, the relative uniformity metric is represented by the equation: The relative uniformity metric is a value within the [0, 1] interval (zero to one interval). The relative uniformity metric always assumes values in the 0 to 1 interval, regardless of the amount of data that is contained in the test documents. Because of that, it is called relative uniformity. A metric value of 1 (one) represents the maximum uniformity and a metric value of 0 (zero) represents the maximum irregularity. If n = 0, the uniformity value is 1. If there is no test data on the feature, the uniformity value 1 is adopted.
The relative uniformity metric is calculated for each pair of features, so the arithmetic mean between all values can be adopted for measuring uniformity in a project. The relative uniformity metric for a project is the sum of the uniformity metric values of each pair of features divided by the total number of pairs. Thus, the relative uniformity metric for a project can be obtained by the following equation:

Metric Implementation
The metric is implemented 2 in the Cucumber tool, so, it is possible to extract the uniformity metrics through computing. The entire code for calculating data uniformity was written in the Java programming language. The data for each test document is extracted with the help of the Cucumber implementation. Cucumber processes the documents and sets up the tests. Then, a listener that collects the test data is implemented in the cucumber.runner.TesteCase class. The metrics are applied after the test data is collected by the listener. It is necessary to run the tests in order to collect the test data.

Case Study I
The first case study investigates the uniformity of data from FitNesse projects in the GitHub repository through the application of the first proposed metrics. Figure 6 shows the general process of searching for FitNesse projects in the GitHub repository. The value inside each rectangle corresponds to the number of repositories found in each step. The process consists of four activities presented in the following subsections.

Searching FitNesse Projects on GitHub
The search for FitNesse projects was carried out on the GitHub platform. GitHub was used because it houses 2 https://github.com/douglashiura/cucumber-data-uniformity.git a wide variety of open-source projects. GitHub provides a word search function. The search was carried out with the word "FitNesse" and a filter (size > 1KB). The search result returned 577 projects, of which 274 projects were developed with the Java language, 57 projects with JavaScript and 39 projects with C#. Still, all projects add up to 12K of commits. The search and data collection on GitHub took place in February 2019.

Filtering FitNesse Projects with Relevant Tests
The 577 repositories found in the previous activity were manually filtered according to four steps:   143

Application of the Proposed Metrics and Data Analysis
The application of the proposed metrics was performed using the FitNesse framework and the implementation of the proposed metrics. The projects were executed and the uniformity metrics, number of features and number of test data per feature were collected. The collected data were presented in graphic charts. The charts and analyses are presented in the next subsection.

Results of the FitNesse Projects Descriptive Measures of the FitNesse Projects
Descriptive measures consist of general information about the projects. Figure 8 shows the number of test data for each project. Project P1 is the smallest project and consists of only 14 test data (input and output data). Project P18 is the largest project and consists of 12707 test data. Projects are ranked by the number of test data, from the smallest to the largest amount. Figure 9 presents the descriptive metrics of FitNesse projects. Descriptive metrics are defined as the average of the test data per feature (input and output data) and the total number of features for each project. Regarding the average test data per feature, project P1 has an average of 2.8 test data per feature and is composed of only five features, thus being the smallest project. Project P18 has an average of 186.87 test data per feature and is composed of 68 features, so, it has the highest average of test data. Project P16 consists of 160 features, which is the largest one in number of features.
There is a wide variety in the number of features per project, as the projects differ a lot in terms of domain, number of people involved and total commits. For example, project P18 has 4 contributors and 469 commits, while project P1 has only one contributor and 6 commits. Thus, the average number of test data per feature may also vary significantly. As an example, project P16 has many features (160) and a small number for the average of test data per feature (9.01), when compared to P18 which has less features (68) and a high number for the average of test data per feature (186,87). This means that the size (number of words written in the document) of the features in project P18 is larger than the one in project P16.
In addition, eight projects (44% of total) have between 10 and 41 test data per feature. Project P18 has an outlying average number of test data per feature and this indicates that having this amount of data in features is not a common thing. Each project has an average of 34 test data per feature and a median of 22 test data per feature. Regarding the total number of features for each project, nine projects (50% of the total projects) have between 9 and 36 features. Each project has an average of 31 features and a median of 16 features. Projects P11, P16 and P18 have distinct amounts of features, that is, they have many more features than the other projects.

Data Uniformity on FitNesse Projects
Average data uniformity was measured by the metric proposed in Eq. 7. Figure 10 shows the average data uniformity for each project. Project P2 has the lowest uniformity rate (0.03) and project P10, the highest one (0.73). Only four projects (22% of total) show data uniformity rates above 0.5. Five projects (28%) are less than 0.1 data uniformity rates, that is, extremely low compared to the recommended value . The average uniformity rate is 0.31 and the median uniformity rate is 0.27.

Informal Assessment
In order to point out evidence that the use of the proposed metric helps to measure the uniformity of pairs of features with irregular or uniform test data in FitNesse projects, an informal assessment was carried out. The evaluation considers some pairs of features from three projects (P5, P6 and P18). The three projects were selected at random. Table 3 shows the result of this informal assessment. The objective was to find out if the measured value of the feature pair, using the proposed metric, is related to an informal assessment that intends to classify the feature pairs as irregular or uniform.

Case Study II
Case study II investigates the uniformity of data from Gherkin notation projects in the GitHub repository. In this case study, the second proposed metrics are applied. Figure 7 shows the general process of searching for Gherkin projects in GitHub. The process consists of four activities presented in the following subsections.

Searching Gherkin Projects on GitHub
The search for Gherkin projects was carried out in GitHub. GitHub repository was selected because it houses a wide variety of projects and many of them with public access. The repository provides an advanced search function that is specific to the Gherkin language and a filter (size > 1KB), which caused the search to return only Gherkin language repositories. The search found 959 projects, ranked by "Best match". Search and data collection on GitHub were carried out in June of 2019.

Filtering Gherkin Projects with Relevant Tests
The activity of selecting Gherkin projects with tests was limited to selecting only the first 18 projects from the 959 repositories found in the previous activity, ranked by "Best match". The limit of 18 projects was applied in order to reduce the research effort requirements and to have the same number of projects that were found for the FitNesse notation. Thus, from the projects ranked by "Best match", only the first 18 repositories with more than four features were selected. In addition, it was required that the project ran properly with the Cucumber framework. Table 4 presents the 18 selected projects.

Application of the Proposed Metrics and Data Analysis
The application of the proposed metrics was carried out with the Cucumber framework and the implementation of the metrics. The projects were executed and uniformity metrics, number of total features and total test data information were collected.

146
The collected data was presented in graphic charts. These results, analyses and charts are presented in the next subsection.

Results of the Gherkin Projects Descriptive Measures of the Gherkin Projects
Descriptive metrics consists of general information about the measured projects. Figure 11 shows the number of test data for each project. Project C1 is the smallest project and consists of only 10 test data. Project C18 is the largest project and consists of 6,085 test data. The projects are ranked by the amount of test data. Figure 12 shows the descriptive metrics of the projects. These are the average test data per feature and the total of features for each project. Project C1 has an average of two test data per feature and it consists of only five features, making it the smallest project. Project C18 has an average of 132.28 test data per feature and it is composed of 46 features, thus being the highest average of test data. Project C16 is composed of 109 features, which makes it the largest one in number of features.

147
The wide diversity in the number of features per project is likely to be caused by the domain of each project, the number of people involved and the total commits. As an example, project P18 has 18 contributors and 1.642 commits while project C1 has only one contributor and 129 commits. The average number of test data per feature is significantly variable, as well. For instance, project C16 has a combination of many features (109) with a small average of test data per feature (6.12), when compared to C18. In project C18, however, this relationship is reversed, as it has an average of 132.28 test data and just 46 features. This means that, in project C16, the features are smaller (few words in the document); while in project C18, the features are bigger (many words in the document).

Data Uniformity on Gherkin Projects
The average data uniformity was measured by the metric proposed in Eq. 13. Figure 13 shows the average data uniformity for each project. Projects C3, C11 and C12 have the smallest uniformity (0.02) and project C1 has the highest uniformity (0.64). Six (33%) projects (C3, C4, C11, C12, C14 and C17) present data uniformity levels below 0.01, that is, data uniformity in these projects is very low and it was probably careless during the specification of the tests. Only two projects (C1 and C2) have the uniformity score greater than 0.5.

Informal Assessment
In order to point out evidence that the use of the metric helps to measure the uniformity of irregular or uniform feature pairs, an informal assessment was carried out. The evaluation was done on some pairs of features of three Gherkin projects (C5, C6 and C18). Table 5 presents the informal assessment of the uniformity of some pairs of features. The objective was to find out if the measured value of the feature pair is related to an informal assessment that intends to classify the feature pairs as irregular or uniform. One feature is huge in relation to the other and therefore uniformity is low. 0,57 C 3 and A The first feature is small and with similar or uniform data. 0,40 B and D 4 One feature is small and another is large and some data is uniform. It presents a cloudy area between uniformity and irregularity. C6 0,45 0,40 E 5 and F 6 There is uniform data, but there is also irregular data. However, it could be more uniform if the test was less vague or avoided using expressions like "When I enter incorrect information" and as an alternative, define what the test data is for the expression "incorrect information". 0,00 E and G 7 Feature G is incomplete and without data. 0,01 H 8 and E There is some uniform data, however in both tests many scenarios are elaborated and the scenarios attempt to cover the Registration and Authentication features. The coverage is in the sense of testing a wide range of data possibilities. These coverage scenarios are different from the acceptance testing proposal. C18 0,15 0,12 I 9 and J 10 Both features are large and difficult to evaluate. Apparently, the uniformity is low. 0,24 L 11 and M 12 There is a lot of test data in the features. There is uniform data, but visually it is difficult to say that the features are uniform. 0,00 N 13 and O 14 The features do not have a very large size, which facilitates an evaluation. The data is very irregular, however, similar data such as "Facility" and "Facility0" are used, which could be uniformized.
148 Fig. 13: Test data uniformity average for each project The informal assessment was carried out with some difficulty, especially because the application domains of the tests are not familiar and, in some cases, because of the size of the features. When the size of the features increases, human evaluation can be impaired by limiting the mental capacity to memorize data. However, the value 0 represented the maximum irregularity and was easily perceived. The highest uniformity value assessed informally was a pair with 0.57. The value 0.40 associated with the large size of a feature indicated a nebulous zone.

FitNesse Vs. Gherkin
The application of the uniformity metrics has yielded quantitative data from both case studies. In this way, quantitative data, such as test data uniformity, can be compared in order to determine whether there are differences between notations. This comparison can be done between the descriptive metrics of the two case studies looking to identify the projects similarities. Additionally, for a deeper investigation on uniformity, the uniformity of the two notations is compared against each other to determine whether there is any relationship between descriptive metrics and uniformity. With the result of the investigation, we look to answer the following questions: RQ1: Are there any difference between the numbers of features in the samples of the two acceptance test notations? RQ2: Is there a difference in the average of test data per feature between the notations? RQ3: Is there a difference in data uniformity between the two notations? RQ4: Is there a correlation between number of features, average of test data by feature and data uniformity? That is, does the size (features and test data) of the project influence data uniformity?
These research questions are linked to the case studies and they are intended to further deepen the study by comparing the two notations. In this way, decisions that are more assertive can be done regarding the uniformity of the two notations. Boxplot diagrams and a hypothesis test are used to analyze the differences between notations in questions RQ1, RQ2 and RQ3. So, by using the boxplot diagram, we look to present data for a visual assessment and by using the hypothesis test, we aim to determine if there are any differences between the samples. The RQ4 question is analyzed using a scatter diagram of the variables number of features, average number of test data per features and data uniformity. The RQ4 question is also analyzed using a regression model.

Answering RQ1
In order to answer RQ1, regarding the number of features per project of case studies I and II, FitNesse and Gherkin, respectively, must be used. Figure 14 shows the visual comparison among the numbers of features of the projects between the two notations. The number of features per project is considered ( Fig.  9 and 12), with 18 FitNesse projects and 18 Gherkin projects. FitNesse projects have a median of 16 features per project, an average of 31.16 features per project, with a standard deviation of 39.19 features. Gherkin projects have a median of 23 features per project and an average of 29.27 features per project, with a standard deviation of 28.07 features. However, the distribution of the number of features per project was analyzed via a T-Test, at a significance level of ( = 0.05) and no statistically significant difference between the number of features of the two notations (Pvalue = 0.8691) was found. Thus, regarding the number of features, it was not possible to observe statistically significant changes between the projects of each notation.

Answering RQ2
In order to answer RQ2, the average number of test data per feature of each project is used, respectively, in FitNesse and Gherkin notations, from the case studies previously mentioned. Figure 15 shows the visual comparison between the average numbers of data per features for each notation.
The average number of test data per feature is considered per project ( Fig. 9 and 12), with 18 FitNesse projects and 18 Gherkin projects. FitNesse notation has a median of 22.21 test data per feature, an average of 34.18 and a standard deviation of 42.54. Gherkin notation has a median of 12.63 test data per feature, an average of 24.41 and a standard deviation of 31.89. The distribution of the average number of test data per feature was analyzed via a T-Test, at a significance level of ( = 0.05), with the result that there is no statistically significant difference in the average number of test data per feature between the two notations (Pvalue = 0.4417). Thus, regarding the average number of test data per feature, it was not possible to observe statistically significant changes between the projects of each notation. The conclusion reached is that the amount or number of test data in each feature is similar for both notations. Question RQ3 can be interpreted in two ways. The first way is whether there are differences in data uniformity between the projects of each acceptance test notation. The second way is whether there are differences in data uniformity between the pairs of features of the FitNesse and Gherkin notations, regardless of the projects.
In order to answer the first interpretation of the question, if there is a difference in uniformity between FitNesse and Gherkin projects, the average uniformity metrics for each project must be used ( Fig. 10 and 13). Figure 16 shows the visual comparison of the average uniformity between each project of the two notations. Figure 16 considers the uniformity of 18 FitNesse projects and 18 Gherkin projects. The FitNesse notation has a median of data uniformity of 0.26, an average of 0.31 and a standard deviation of 0.23. The Gherkin notation has a median of data uniformity of 0.18 an average of 0.24 and a standard deviation of 0.19. The uniformity of the 36 projects analyzed does not have a regular distribution. Therefore, a transformation must be applied to the data. The transformation applied was the square root. After data was transformed from the application of the square root, distribution was normalized and the variances were equal. Thus, the distribution of data uniformity per project was analyzed via a T-Test, at a significance level of ( = 0.05), with the result that there is no statistically significant difference in data uniformity per project between the two notations (Pvalue = 0.2961). So, regarding data uniformity per project, it was not possible to observe statistically significant changes between notations. The conclusion reached is that test data uniformity of the projects is similar in both notations.
Longo and Vilain (2018) advocate a 0.45 uniformity value as the minimum value to be reached before the automation of the tests. As Fig. 16 displays, most projects tend to be irregular rather than uniform when using this minimum value. Only 8 (22%) of the 36 projects reached the minimum measure of 0.45 data uniformity. In order to answer the second interpretation of the question, if there is a difference in uniformity between the pairs of features of FitNesse and Gherkin, the uniformity metrics of all feature pairs for all projects must be used. Figure 17 shows the visual comparison of the uniformity distribution of each feature pair for each project split by notations. For the FitNesse notation, 43.040 pairs of features extracted from 18 projects are considered. For the Gherkin notation, 28.306 pairs of features are considered.
Analyzing the pairs of features, the FitNesse notation has a median of uniformity of 0.00, an average of 0.16 and a standard deviation of 0.27. The 0.00 median of uniformity is justified by the fact that most of the FitNesse feature pairs are irregular between each other. The Gherkin notation has a median of uniformity of 0.08, an average of 0.26 and a standard deviation of 0.34. Thus, in a preliminary analysis, the uniformity metrics value is lower than 0.5. According to the scale, one is the most uniform value and zero is the most irregular value.
Therefore, it can be stated that the projects from both notations are more irregular than regular.
The distribution of data uniformity of the feature pairs of each notation was analyzed via a Z-Test, at a significance level of ( = 0.05), resulting on a statistically significant difference in the uniformity of the features between the two notations (Pvalue = 0.0008967). Thus, regarding data uniformity by pairs of features, it was possible to observe statistically significant changes between notations. Since the uniformity metrics values are lower than 0.5, the conclusion reached is that the features are more irregular in the FitNesse notation when compared to the Gherkin notation.
To summarize the answer to question RQ3, the conclusion reached is that the data uniformity of the projects is similar between FitNesse and Gherkin notations. When the analysis is expanded to features, it can be said that the data of the FitNesse features is more irregular than the data of the Gherkin features.

Answering RQ4
In order to answer question RQ4, the variables of number of features, average number of test data per by feature and data uniformity of each project are used. Figure 18 presents the dispersion diagrams for the variables of the 36 projects. From a visual perspective, the scatter diagrams do not show correlation between the variables. Linear regression was used for the analysis and discrepant data were excluded at first. These outlying data are from projects, P11, P16, P17, P18, C16 and C18. Therefore, an attempt was done to build a suitable model. However, none of the independent variables, number of features and average number of test data per feature have a significant correlation with the variable response (data uniformity). At a significance level of ( = 0.05), with Pvalue = 0.1361, first-degree linear regression is not adequate, namely, there is no model to characterize the data. Thus, it is concluded that the variables number of feature and average number of test data per feature have no correlation with data uniformity. That is, the size of the project does not influence the uniformity of the data.

Threats to Validity
The case studies are incomplete without discussing the concerns that can threaten the validity of the results. Internal validity refers to causal inferences based on experimental data (Yin, 2003). As for the case studies, the scope of the work was limited to projects found exclusively as public GitHub repositories. The GitHub platform was chosen because it has more than 10 million repositories and is becoming one of the most important sources of software artifacts on the Internet (Kalliamvakou et al., 2014). The GitHub platform was used in order to mitigate any bias regarding size, human qualification and type of the repository. In this way, the GitHub platform provided projects from all around the world. Above all, only public projects were used, which may represent a difference in results when compared to private projects. In both case studies, only projects with more than five features were selected. This restriction was intended to avoid extremely small and/or premature projects. The sample of projects for case study I was composed of all executable projects that could have the metrics applied on them. So, for case study I, accurate decisions can be done regarding FitNesse projects on GitHub.
The sample of case study II is composed only of the 18 executable repositories that could have the metrics applied on them, so not all GitHub repositories, but a sample of 18 out of 959. This sample was limited by the application effort of the metrics, because the metrics was applied with the Cucumber framework and, depending on the project, several interventions were necessary for the correct execution. Still, there is some bias to case study II, that is, the first 18 repositories ranked by "Best match" were selected. This bias occurred because, initially, the intention was to investigate all GitHub repositories, but the time and effort required to do so discouraged a complete investigation. Above all, it is highlighted that there is no difference between FitNesse and Gherkin between the number of features and average number of test data per feature of the projects. This equality between FitNesse and Gherkin projects mitigates the bias in the selection of Gherkin projects and contributes to the validity of the conclusions.
The construction of validity refers to the appropriate use of metrics and evaluation measurements (Yin, 2003). In the case studies, uniformity measures were obtained according to the proposed metrics. In order to calculate other statistical metrics, statistics were used upon agreement and the recommended practices for applying them were scrupulously followed.
External validity refers to the ability to generalize the findings to other domains (Yin, 2003). The external validity of the research poses a threat to both case studies. The threat is that the scope is limited to the public GitHub repositories. In private repositories, there may be longer and larger projects, additionally, there may be better qualified teams involved in the task of treating the data to improve the quality of acceptance tests. However, the sample was made up of public repositories from around the world and for various purposes of applicability of acceptance tests. Reliability refers to the ability of other researchers to replicate a methodology (Massey et al., 2014). The metric proposal for the two notations, the evaluation technique of the case studies and its results were detailed. Still, we consider that it is important for other researchers to be able to reproduce the study, for which the implementation and all the collected data were done available on GitHub 3,4 .

Discussion
The applicability of the proposed metrics was feasible because their implementation was included in the FitNesse and Cucumber frameworks. The proposed metrics were applied to 18 FitNesse projects and 18 Gherkin projects. Uniformity measurements were done at two levels: At the level of pairs of features and at the project level. At the project level, the uniformity metrics presents a more general view. However, the uniformity metrics at the project level was low, that is, most of the analyzed projects, regardless of their rating, had less than 0.45 of data uniformity. Longo and Vilain (2018) classified projects with less than 0.45 uniformity as 153 irregular. The reason why many projects have irregular data has not been identified; however, it is assumed that data uniformity has not been addressed in these projects.
A simple training can be effective in the treatment of uniformity in the specification of acceptance tests by a single person. However, when a team is responsible for the specification and development, complications in the uniformity can arise. In a team specification, individuals can know different examples of data that can be easily used at the time of specification. However, during the test automation, when the tester implements the glue code, doubts may arise when test data are irregular. Due to not being close to test specifiers, testers can often follow through with their doubts and produce a glue code with some bad smell. Thus, this can accumulate bad smell throughout the development of the tests. If these doubts are associated with the irregularity of the test data, the ideal solution is to review the uniformity before implementing the glue code. This way, the metric can be useful for an overall evaluation of the project and an evaluation of the pairs of features.
Upon evaluating the project, we do not have a specific number to indicate whether a FitNesse or Cucumber project is uniform or not. Longo and Vilain (2018) suggest a minimum uniformity value of 0.45 for US-UID projects. Above all, the impact of uniformity in the automation of tests has already been studied in this other notation. The application of the metrics can be used for any type of project, regardless of the uniformity metrics, but it is unclear which minimum uniformity value could be indicated in the process of test specification and automation in order to obtain better communication advantages and high-quality glue code.
In that sense, it is recommended that the minimum value of 0.45 of data uniformity of a project be reached before starting the test automation process. In addition, during the application of the metric in the case studies, it was possible to observe some quality criteria in the projects with more uniformity. One of the quality criteria was clarity, that is, one can read and understand most tests with greater uniformity. Above all, there are many factors that can influence clarity, such as knowledge of the problem and the language used to write the test. Not all evaluated projects had glue code available, which suggests low quality works 5 . However, even in projects with glue code available, a low quality was observed, that is, the glue code was written with several responsibilities, including loop (for) and deviations (if's). As stated by (Meszaros, 2007), conditional test logic becomes tests harder to understand and should be avoid whenever possible, as it was possible to understand, the deviations were just being used to enable the SUT configuration and were used because of irregular test data. 5 https://github.com/Vardot/varbase-behat/blob/8.x-4.x/features/bootstrap/SelectorsContext.php

Conclusion
The comparison between FitNesse and Gherkin suggests that there are no differences in project sizes. There is also no difference in uniformity between projects. However, when comparing pairs of features, there is a significant difference. The pairs of features of FitNesse have less uniformity than the pairs of Gherkin features. Above all, data uniformity values in the evaluated projects are low, that is, notations from both projects have high amounts of irregular test data.
The main contributions of this article are the two proposed metrics. These metrics can be applied to any project with Gherkin or FitNesse notation and can be a good quantitative assessment tool. A contribution that can be used in future research is the information collected from the 36 GitHub projects. The metric has potential for applicability throughout the specification of acceptance tests. The teams using the metrics can go through the measurements and make interferences to improve uniformity before the test automation. The other contributions of this work include the findings from the case of studies. First, we found out that there are no statistical differences between FitNesse and Gherkin notations regarding the uniformity. We also found out that the size of the project does not appear to have impact in the uniformity. Above all, we discovered that projects with a greater uniformity tend to have a better glue code with better reuse.
An investigation of the impact of uniformity on communication throughout development is suggested as a future work, as well as experiments evaluating the effort and volume of the glue code for projects with different levels of uniformity. Another potential future work is to investigate the use of test data recommendation systems to assist the construction of tests, by recommending uniform data. Also, the measurement of uniformity could be included in the editing tools of FitNesse and Cucumber tests as a guide to test specifiers. Actually, the application of the proposed metrics took place through implementation that consider the execution of the FitNesse and Cucumber testing frameworks.