HTML Sanity Checker Architecture Documentation

arc42

© This document uses material from the arc42 architecture template, freely available at http://github.com/arc42.

This material is open source and provided under the Creative Commons Sharealike 4.0 license. It comes without any guarantee. Use on your own risk. arc42 and its structure by Dr. Peter Hruschka and Dr. Gernot Starke. Asciidoc version initiated by Markus Schärtel and Jürgen Krey, completed and maintained by Ralf Müller and Gernot Starke.

Version 0.9 of 2014-10-25

:license-ccsa4-green htmlSanityCheck htmlSanityCheck

if the issue-count is not display or says "vendor unresponsive", our badge provider shields.io cannot access the github data fast enough…​

Within the following text, the "Html Sanity Checker" shall be abbreviated with HtmlSC

Goals of this Documentation

This documentation is an example of arc42 documentation.

You may copy this documentation or parts of it for your own projects. In such cases you must include a link or reference to arc42 or aim42 (we regard this as fair-use).

For real-world projects, the relation of code and documentation is oversized.

Disclaimer

We provide absolutely no guarantee, neither for the accuracy of this documentation nor for any property or feature of the software described here.

Do not use this software in critical situations or projects.

1. Introduction and Goals

1.1. Requirements Overview

1.1.1. Basic Usage

  1. A user configures the location (directory and filename) of an HTML file,

  2. and the corresponding images directory.

  3. HtmlSC performs various checks on the HTML and

  4. reports its results either on the console or as HTML report.

HtmlSC can run from the command line or as Gradle-plugin.

1.1.2. Terminology: What Can Go Wrong in HTML Files?

Apart from purely syntactical errors, many things can go wrong in html, especially with respect to hyperlinks, anchors and id’s - as those are often manually maintained.

Broken Cross References:: Cross-references (internal links) can be broken,

e.g. due to missing or misspelled link-target. See BrokenCrossReferencesChecker

Missing local resources: Referenced local resources (other than images) can be missing or misspelled.
See <<MissingLocalResourcesChecker>>
Duplicate link targets: link-targets can occur several times with the same name - so the browser cannot know

which is the desired target. See DuplicateLinkTargetsChecker.

Illegal links:: The links (aka anchors or URIs) can contain illegal characters or violate HTML link syntax.
See <<IllegalLinkChecker
Broken external links: External links can be broken due to myriads of reasons: misspelled, link-target currently offline,
illegal link syntax. See <<BrokenExternalLinksChecker>>.
Missing Alt Attribute in Image Tags: Images missing an alt-attribute.

Checking and reporting these errors and flaws is the central business requirement of HtmlSC.

Important terms (domain terms) of html sanity checking is documented in a (small) domain model.

1.1.3. General Functionality

Table 1. General Requirements
ID Functionality Description

G-1

read HTML file

HtmlSC shall read a single (configurable) HTML file

G-2

Gradle-plugin

HtmlSC can be run as Gradle-plugin.

G-3

command line usage

HtmlSC can be called from the command line with arguments and options

G-4

configurable output

output can be configured to console or file

G-5

free and open source

all required dependencies shall be compliant to the CC-SA-4 licence.

G-6

available via public repositories, like bintray or jcenter.

G-7

1.1.4. Types of Sanity Checks

Table 2. Required Checks
ID Check Description

R-1

missing image files

Check all image tags if the referenced image files exist. See MissingImageFilesChecker

R-2

broken internal links

Check all internal links from anchor-tags (href="XYZ") if the link targets "XYZ" are defined. See BrokenCrossReferencesChecker

R-3

missing local files

either other html-files, pdf’s or similar. See MissingLocalResourcesChecker

R-4

duplicate link targets

Check all bookmark definitions (…​ id="XYZ") whether the id’s ("XYZ") are unique. See DuplicateLinkTargetsChecker

R-5

Malformed links

Check all links for syntactical correctness

R-6

missing alt-attribute

in image-tags. See [MissingImgAltAttributeChecker]

R-7

unused-images

check for files in image-directories that are not referenced by any of the HTML files in this run

R-8

illegal link targets

Checks for malformed or illegal anchors (link targets).

Table 3. Optional Checks
ID Check Description

Opt-1

missing external images

Check externally referenced images for availability

Opt-2

broken external links

Check external links for both syntax and availability

1.1.5. Reporting and Output Requirements

Table 4. Reporting Requirements
ID Requirement Description

R-1

various output formats

checking output in plaintext and HTML

R-2

output to stdout

HtmlSC can output results on stdout (the console)

R-3

configurable file output

HtmlSC can store results in file in configurable directories

1.2. Quality Goals

Table 5. Quality-Goals
Priority Quality-Goal Scenario

1

Correctness

Every broken internal link (crossreference) is found.

1

Correctness

Every missing local image is found.

2

Flexibility

multiple checking algorithms, report formats and clients. At least Gradle, command-line and a graphical client have to be supported.

2

Safety

Content of the files to be checked is never altered.

2

Correctness

Correctness of every checker is automatically tested for positive AND negative cases

2

Correctness

Every reporting format is tested: Reports must exactly reflect checking results.

3

Performance

Check of 100kB html file performed under 10 secs (excluding gradle startup)

1.3. Stakeholder

Table 6. Stakeholder
Role Description Goal, Intention

Documentation author

writes documentation with Html output

wants to check that the resulting document contains good links, image references

arc42 user

uses the arc42 template for architecture documentation

wants a small but practical example of how to apply arc42.

aim42 contributor

contributes to aim42 methode-guide

check generated html code to ensure links and images are correct during (gradle-based) build process

software developer

wants an example of pragmatic architecture documentation and arc42 usage

2. Constraints

HtmlSC shall be:

  • platform independend and should run on the major operating systems (Windows ™, Linux and Mac-OS™)

  • integrated with the Gradle build tool

  • runnable from the command line

  • developed under a liberal open-source license

3. Context

3.1. Business Context

Business Context
Figure 1. Business Context
Table 7. Business Context
Neighbour Description

user

documents software with toolchain that generates html. Wants to ensure that links within this html are valid.

build system

local html files

HtmlSC reads and parses local html files and performs sanity checks within those.

local image files

HtmlSC checks if linked images exist as (local) files.

external web resources

Optionally HtmlSC can be configured to check for the existance of external web resources. Due to the nature of web systems, this check might need a significant amount of time and might yield invalid results due to network and latency issues.

3.2. Deployment Context

The following diagram shows the participating computers ({node}) with their technical connections plus the major {artifact} of HtmlSC, the hsc-plugin-binary.

Deployment Context
Figure 2. Deployment Context
Table 8. Deployment Context
Node / Artifact Description

{node} hsc-development

where development of HtmlSC takes place

{artifact} hsc-plugin-binary

Compiled and packaged version of HtmlSC including required dependencies.

{node} artifact repository (Bintray)

global public cloud repository for binary artifacts, similar to mavenCentral. HtmlSC binaries are uploaded to this server.

{node} hsc user computer

where arbitrary documentation takes place with html as output formats.

{artifact} build.gradle

Gradle build script configuring (among other things) the HtmlSC plugin to perform the Html checking.

Details see deployment view.

4. Solution Strategy

5. Building Block View

5.1. Whitebox HtmlSanityChecker

Whitebox
Figure 3. Whitebox (Overall system)
Rationale

We used functional decomposition to separate responsibilities:

  • CheckerCore shall encapsulate checking logic and Html parsing/processing.

  • all kinds of outputs (console, html-file, graphical) shall be handled in a separate component (Reporter)

  • Implementation of Gradle specific stuff shall be encapsulated.

Contained Blackboxes
Table 9. HtmlSanityChecker building blocks

CheckerCore

core: html parsing and sanity checking, file handling

HSC Gradle Plugin

integrates the Gradle build tool with HtmlSC, enabling arbitrary gradle builds to use HtmlSC functionality.

HSC Command Line Interface

 (not documented)

HSC Graphical Interface

(planned, not implemented)

Reporter

outputs the collected checking results to configurable destinations, e.g. StdOut or a Html file.

Interfaces
Table 10. HtmlSanityChecker internal interfaces
Interface Description

usage via shell

arc42 user uses a command line shell to call the HtmlSC

build system

 currently restricted to Gradle: The build system uses HtmlSC as configured in the buildscript.

local-html and local-images

HtmlSC needs access to several local files, especially the html page to be checked and to the corresponding image directories.

5.1.1. CheckerCore (Blackbox)

Intent/Responsibility

Checker contains the core functions to perform the various sanity checks. It parses the html file into a DOM-like in-memory representation, which is then used to perform the actual checks.

Interfaces
Table 11. CheckerCore Interfaces
Interface (From-To) Description

Command Line Interface → Checker

 Exposes the #AllChecksRunner class, as described in AllChecksRunner.

Gradle Plugin → Checker

Exposes HtmlSC via a standard Gradle plugin, as described in the Gradle user guide.

Files
  • org.aim42.htmlsc.AllChecksRunner

  • org.aim42.htmlsc.HtmlSanityCheckGradlePlugin

5.2. Building Blocks - Level 2

5.2.1. CheckerCore (Whitebox)

Whitebox
Figure 4. CheckerCore (Whitebox)
Rationale

This structures follows a strictly functional decomposition:

  • parsing and handling html input

  • checking

  • collecting checking results

Contained Blackboxes
Table 12. CheckerCore building blocks

Checker

abstract class, used in form of the template-pattern. Shall be subclassed for all checking algorithms.

AllChecksRunner

Facade to the different Checker instances. Provides a (parameter-driven) command-line interface.

[ResultsCollector]

Collects all checking results. Its interface Results is contained in the whitebox description

HtmlParser

Encapsulates html parsing, provides methods to search within the (parsed) html.

5.2.2. Checker and xyzChecker Subclasses

The abstract Checker provides a uniform interface (public void check()) to different checking algorithms. It is based upon the concept of extensible checking algorithms.

5.3. Building Blocks - Level 3

5.3.1. ResultsCollector (Whitebox)

Whitebox
Figure 5. Results Collector (Whitebox)
Rationale

This structures follows the hierarchy of checks - namely managing results for:

  1. a number of pages/documents, containing:

  2. a single page, each containing many

  3. single checks within a page

Contained Blackboxes
Table 13. ResultsCollector building blocks

Per-Run Results

results for potentially many Html pages/documents.

Single-Page-Results

results for a single page

Single-Check-Results

results for a single type of check (e.g. missing-images check)

Finding

a single finding, (e.g. "image 'logo.png' misssing"). Can hold suggestions and (planned for future releases) the responsible html element.

Interface Results

The Result interface is used by all clients (especially Reporter subclasses, graphical and command-line clients) to access checking results. It consists of three distinct APIs for overall RunResults, single-page results (PageResults) and single-check results (CheckResults). See the interface definitions below - taken from the Groovy-sourcecode:

Interface RunResults
public interface RunResults {

    // returns results for all pages which have been checked
    public ArrayList<SinglePageResults> getResultsForAllPages()

    // how many pages were checked in this run?
    public int nrOfPagesChecked()

    // how many checks were performed in all?
    public int nrOfChecksPerformedOnAllPages()

    // how many findings (errors and issues) were found in all?
    public int nrOfFindingsOnAllPages()

    // how long took checking (in milliseconds)?
    public Long checkingTookHowManyMillis()
}
Interface PageResults
public interface PageResults {

    // what's the title of this page?
    public String getPageTitle()

    // what's the filename and path?
    public String getPageFileName()
    public String getPageFilePath()

        // how many items have been checked?
    public int nrOfItemsCheckedOnPage()

    // how many problems were found on this page?
    public int nrOfFindingsOnPage()

    // how many different checks have run on this page?
    public int howManyCheckersHaveRun()

    }
Interface CheckResults
public interface CheckResults {

    // return a description of what is checked
    // (e.g. "Missing Images Checker" or "Broken Cross-References Checker"
    public String description()

    // returns all findings/problems found during this check
    public  ArrayList<Finding> getFindings()

    }

6. Runtime View

Note: Not appropriate for this system due to very simple implementation.

7. Deployment View

Deployment
Figure 6. Deployment
Table 14. Deployment
Node / Artifact Description

hsc plugin binary

compiled version of HtmlSC, including required dependencies.

hsc-development

where development of HtmlSC takes place

artifact repository (Bintray)

global public cloud repository for binary artifacts, similar to mavenCentral. HtmlSC binaries are uploaded to this server.

hsc user computer

where arbitrary documentation takes place with html as output formats.

build.gradle

Gradle build script configuring (among other things) the HtmlSC plugin to check some documentation.

The three nodes (computers) shown in Deployment are connected via public internet.

Sanity checker will:

  1. be bundled as a single jar.

  2. be uploaded to the Bintray repository,

  3. referencable within a gradle buildfile.

  4. provide a main method with parameters and options, so all checks can be called from the command line.

8. Technical and Crosscutting Concepts

8.1. HTML Checking Domain Model

HTML Checking Domain Model
Figure 7. HTML Checking Domain Model
Table 15. Domain Model
Term Description

Anchor

Html element to create →Links. Contain link-targets in the form <a href="link-target">

Cross reference

link from one part of the document to another part within the same document. A special form of →internal-link, with a →link-target in the same document.

external link

link to another page or resource at another domain.

Finding

Description of a problem found by one →Checker within the →Html Page.

Html Element

HTML pages (documents) are made up by HTML elements, .e.g. <a href="link target">, <img src="image.png"> and others. See the W3-Consortium

HTML Page

A single chunk of HTML, mostly regarded as a single file. Shall comply to standard HTML syntax. Minimal requirement: Our HTML parser can successfully parse this page. Contains →Html Elements. Also called html document.

id

Identifier for a specific part of a document, e.g. <h2 id="#someHeader">. Often used to describe →link targets.

internal link

link to another section of the same page or to another page of the same domain. Also called local link.

Link

Any a reference in the →html page that lets you display or activate another part of this document (internal ink) or another document, image or resource (can be either →internal (local) or →external link). Every link leads from the link source to the link target

Link Target

the target of any →link, e.g. heading or any other a part of a html document, any internal or external resource (identified by URI). Expressed by →id

local resource

local file, either other html files or other filetypes (e.g. pdf, docx)

Run Result

The overall results of checking a number of pages (at least one page).

Single Page Result

A collection of all checks of a single → HTML Page.

URI

Universal Resource Identifier. Defined in RFC-2396. The ultimate source of truth concerning link syntax and semantic.

8.2. Gradle Plugin Concept and Development

You should definitely read the original [Gradle User Guide] on custom plugin development.

To enable the required Gradle integration, we implement a lean wrapper as described in the Gradle user guide.

  class HtmlSanityCheckPlugin implements Plugin<Project> {
      void apply(Project project) {
          project.task('htmlSanityCheck',
                  type: HtmlSanityCheckTask,
                  group: 'Check')
      }
  }

8.2.1. Directory Structure and Required Files

|-htmlSanityCheck
   |  |-src
   |  |  |-main
   |  |  |  |-org
   |  |  |  |  |-aim42
   |  |  |  |  |  |-htmlsanitycheck
   |  |  |  |  |  |  | ...
   |  |  |  |  |  |  |-HtmlSanityCheckPlugin.groovy (1)
   |  |  |  |  |  |  |-HtmlSanityCheckTask.groovy
   |  |  |  |-resources
   |  |  |  |  |-META-INF                          (2)
   |  |  |  |  |  |-gradle-plugins
   |  |  |  |  |  |  |-htmlSanityCheck.properties  (3)
   |  |  |-test
   |  |  |  |-org
   |  |  |  |  |-aim42
   |  |  |  |  |  |-htmlsanitycheck
   |  |  |  |  |  |  | ...
   |  |  |  |  |  |  |-HtmlSanityCheckPluginTest
   |
1 the actual plugin code: HtmlSanityCheckPlugin and HtmlSanityCheckTask groovy files
2 Gradle expects plugin properties in META-INF
3 Property file containing the name of the actual implementation class:

8.2.2. Passing Parameters From Buildfile to Plugin

To be done

8.2.3. Building the Plugin

The plugin code itself is built with gradle.

8.2.4. Uploading to Public Archives

8.2.5. Further Information on Creating Gradle Plugins

Although writing plugins is described in the Gradle user guide, a clearly explained sample is given in a Code4Reference tutorial.

8.3. Flexible Checking Algorithms

HtmlSC uses the {template-method-pattern} to enable flexible checking algorithms:

The Template Method defines a skeleton of an algorithm in an operation, and defers some steps to subclasses.
— http://sourcemaking.com/design_patterns/template_method

We achieve that by defining the skeleton of the checking algorithm in one operation, deferring the specific checking algorithm steps to subclasses.

The invariant steps are implemented in the abstract base class, while the variant checking algorithms have to be provided by the subclasses.

    /**
    ** template method for performing a single type of checks
     * on the given @see HtmlPage.
     *
     * Prerequisite: pageToCheck has been successfully parsed,
     * prior to constructing this Checker instance.
    **/
    public CheckingResultsCollector performCheck() {
        // assert prerequisite
        assert pageToCheck != null
        initResults()
        return check() // execute the actual checking algorithm
    }
Template Method
Figure 8. Template-Method Overview
Table 16. Template Method
Component Description

Checker

abstract base class, containing the template method check() plus the public method performCheck()

[ImageFileExistChecker]

checks if referenced local image files exist

BrokenCrossReferencesChecker

checks if cross references (links referenced within the page) exist

[DuplicateIdChecker]

checks if any id has multiple definitions

MissingAltTagsForImagesChecker

(planned) checks if there are image tags without alt-attributes.

8.3.1. MissingImageFilesChecker

Adresses requirement R1.

The (little) problem with checking images is their path: Consider the following HTML fragment (from the file testme.html:

<img src="./images/one-image.jpg">

This image file ("one-image.jpg") has to be located relative to the directory containing the corresponding HTML file.

Therefore the expected absolute path of the "one-image.jpg" has to be determined from the absolute path of the html file under test.

We check for existing files using the usual Java api, but have to do some directory arithmetic to get the absolutePathToImageFile:

  File f = new File( absolutePathToImageFile );
  if(f.exists() && !f.isDirectory())

8.3.2. BrokenCrossReferencesChecker

Adresses requirement R-2.

Cross references are document-internal links where the href="link-target" from the html anchor tag has no prefix like +http, https, ftp, telnet, mailto, file and such.

Only links with prefix # shall be taken into account, e.g. <a href="#internalLink">.

8.3.3. DuplicateLinkTargetsChecker

Adresses requirement R-4.

Sections, especially headings, can be made link-targets by adding the id="#xyz" element, yielding for example html headings like the following example.

Problems occur if the same link target is defined several times (also shown below).

  <h2 id="seealso">First Heading</h2>
  <h2 id="seealso">Second Heading</h2>
  <a href="#seealso">Duplicate definition - where shall I go now?</a>

8.3.4. MissingLocalResourcesChecker

Adresses requirement R-3.

8.3.5. BrokenExternalLinksChecker

Adresses requirement R-9.

Problem here are networking issues, latency and http return codes. This checker is planned, but currently not implemented.

8.4. Encapsulate HTML Parsing

We encapsulate the third-party HTML parser (http://jsoup.org) in simple wrapper classes with interfaces specific to our different checking algorithms.

8.5. Flexible Reporting

HtmlSC allows for different output formats:

  • formats (HTML and text) and

  • destinations (file and console)

The reporting subsystem uses the template method pattern to allow different output formats (e.g. Console and HTML). The overall structure of reports is always the same:

Graphical clients can use the API of the reporting subsystem to display reports in arbitrary formats.

The (generic and abstract) reporting is implemented in the abstract Reporter class as follows:

/**
 * main entry point for reporting - to be called when a report is requested
 * Uses template-method to delegate concrete implementations to subclasses
*/
    public void reportFindings() {
        initReport()            (1)
        reportOverallSummary()  (2)
        reportAllPages()        (3)
        closeReport()           (4)
    }
//
    private void reportAllPages() {
        pageResults.each { pageResult ->
            reportPageSummary( pageResult ) (5)
            pageResult.singleCheckResults.each { resultForOneCheck ->
               reportSingleCheckSummary( resultForOneCheck )  (6)
               reportSingleCheckDetails( resultForOneCheck )  (7)
               reportPageFooter()                             (8)
        }
    }
1 initialize the report, e.g. create and open the file, copy css-, javascript and image files.
2 create the overall summary, with the overall success percentage and a list of all checked pages with their success rate.
3 iterate over all pages
4 write report footer - in HTML report also create back-to-top-link
5 for a single page, report the nr of checks and problems plus the success rate
6 for every singleCheck on that page, report a summary and
7 all detailed findings for a singleCheck.
8 for every checked page, create a footer, pagebreak or similar to graphically distringuish pages between each other.

8.6. Styling the Reporting Output

  • The HtmlReporter explicitly generates css classes together with the html elements, based upon css styling re-used from the Gradle JUnit plugin.

  • Stylesheets, a minimized version of JQuery javascript library plus some icons are copied at report-generation time from the jar-file to the report output directory.

  • Styling the back-to-top arror/button is done as a combination of JavaScript plus some css styling, as described in http://www.webtipblog.com/adding-scroll-top-button-website/.

8.6.1. Attributions

9. Design Decisions

In the current {revision} we won’t check external links. These checks have been postponed to later versions.

9.2. HTML Parsing with jsoup

To check HTML we parse it into an internal (DOM-like) representation. For this task we use jsoup HTML parser, an open-source parser without external dependencies.

To quote from the jsoup website:

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
Goals of this decision

Check HTML programatically by using an existing API that provides access and finder methods to the DOM-tree of the file(s) to be checked.

Decision Criteria
  • few dependencies, so the HtmlSC binary stays as small as possible.

  • accessor and finder methods to find images, links and link-targets within the DOM tree.

Alternatives
  • HTTPUnit: a testing framework for web applications and -sites. Its main focus is web testing and it suffers from a large number of dependencies.

  • jsoup: a plain HTML parser without any dependencies (!) and a rich api to access all HTML elements in DOM-like syntax.

Find details on how HtmlSC implements HTML parsing in the HTML encapsulation concept.

9.3. String Similarity Checking with Jaro-Winkler-Distance

The small java-string-similarity library (by Ralph Allen Rice) contains implementations of several similarity-calculation algorithms. As it is not available as public binary, we use the sources instead, primarily:

net.ricecode.similarity.JaroWinklerStrategyTest
net.ricecode.similarity.JaroWinklerStrategy
The actual implementation of the similarity comparison has been postponed to a later release of HtmlSC

10. Glossary

See the domain model for explanations of important terms.