1. Introduction and Goals

HtmlSC shall support authors creating digital formats with hyperlinks and integration of images and similar resources.

1.1. Requirements Overview

The overall goal of HtmlSC is to create neat and clear reports, showing errors within HTML files - as shown in the adjoining figure.

sample hsc report

1.1.1. Basic Usage

  1. A user configures the location (directory and filename) of one or more HTML file(s),

  2. and the corresponding images directory.

  3. HtmlSC performs various checks on the HTML and

  4. reports its results either on the console or as HTML report.

HtmlSC can run from the command line or as Gradle-plugin.

Terminology: What Can Go Wrong in HTML Files?

Apart from purely syntactical errors, many things can go wrong in html, especially with respect to hyperlinks, anchors and id’s - as those are often manually maintained.

Primary sources of problems are bad links (in technical terms: URIs). For further information, see the background information on URIs.

Broken Cross References:: Cross-references (internal links) can be broken, e.g. due to missing or misspelled link-targets.

See BrokenCrossReferencesChecker

Missing image files: Referenced image files can be missing or misspelled.

See MissingImageFilesChecker.

Missing local resources: Referenced local resources (other than images) can be missing or misspelled.

See MissingLocalResourcesChecker

Duplicate link targets: link-targets can occur several times with the same name - so the browser cannot know which is the desired target.

See DuplicateIdChecker.

Broken external links: External http links can be broken due to myriads of reasons: misspelled, link-target currently offline, illegal link syntax.

See BrokenHttpLinksChecker.

Missing Alt Attribute in Image Tags: Images missing an alt-attribute.

See MissingImgAltAttributeChecker.

Checking and reporting these errors and flaws is the central business requirement of HtmlSC.

Important terms (domain terms) of html sanity checking is documented in a (small) domain model.

1.1.2. General Functionality

Table 1. General Requirements
ID Functionality Description


read HTML file

HtmlSC shall read a single (configurable) HTML file



HtmlSC can be run as Gradle-plugin.


command line usage

HtmlSC can be called from the command line with arguments and options


configurable output

output can be configured to console or file


free and open source

all required dependencies shall be compliant to the CC-SA-4 licence.


available via public repositories

like bintray or jcenter.


configurable to check multiple HTML files

configure a set of files to be processes in a single run and produce a joint report. (useful for e.g. API documentation with many HTML files referencing each other)

1.1.3. Types of Sanity Checks

Table 2. Required Checks
ID Check Description


missing image files

Check all image tags if the referenced image files exist. See [MissingImageFilesChecker]


broken internal links

Check all internal links from anchor-tags (href="#XYZ") if the link targets "XYZ" are defined. See [BrokenCrossReferencesChecker]


missing local files

either other html-files, pdf’s or similar. See [MissingLocalResourcesChecker]


duplicate link targets

Check all bookmark definitions (…​ id="XYZ") whether the id’s ("XYZ") are unique. See [DuplicateIdChecker]


malformed links

Check all links for syntactical correctness


missing alt-attribute

in image-tags. See [MissingImgAltAttributeChecker]



Check for files in image-directories that are not referenced by any of the HTML files in this run


illegal link targets

Check for malformed or illegal anchors (link targets).

Table 3. Optional Checks
ID Check Description


missing external images

Check externally referenced images for availability


broken external links

Check external links for both syntax and availability

1.1.4. Reporting and Output Requirements

Table 4. Reporting Requirements
ID Requirement Description


various output formats

Checking output in plain text and HTML


output to stdout

HtmlSC can output results on stdout (the console)


configurable file output

HtmlSC can store results in file in configurable directories

1.2. Quality Goals

Table 5. Quality-Goals
Priority Quality-Goal Scenario



Every broken internal link (cross reference) is found.



Every missing local image is found.



Multiple checking algorithms, report formats and clients. At least Gradle, command-line and a graphical client have to be supported.



Content of the files to be checked is never altered.



Correctness of every checker is automatically tested for positive AND negative cases



Every reporting format is tested: Reports must exactly reflect checking results.



Check of 100kB html file performed under 10 secs (excluding gradle startup)

1.3. Stakeholder

Table 6. Stakeholder
Role Description Goal, Intention

Documentation author

writes documentation with Html output

wants to check that the resulting document contains good links, image references

arc42 user

uses the arc42 template for architecture documentation

wants a small but practical example of how to apply arc42.

aim42 contributor

contributes to aim42 methode-guide

check generated html code to ensure links and images are correct during (gradle-based) build process

software developer

wants an example of pragmatic architecture documentation and arc42 usage

1.4. Background Information on URIs

The generic structure of a Uniform Resource Identifier consists of the following parts: [type][://][subdomain][domain][port][path][file][query][hash]

An example, visualized:

uri generic example

The java.net.URL class contains a generic parser for URLs and URIs. See the following snippet, taken from the unit test class URLUtilTest.groovy:

Generic URI Structure
    public void testGenericURISyntax() {
        // based upon an example from the Oracle(tm) Java tutorial:
        // http://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html
        def aURL =
                new URL("http://example.com:42/docs/tutorial/index.html?name=aim42#INTRO");
        aURL.with {
            assert getProtocol() == "http"
            assert getAuthority() == "example.com:42"
            assert getHost() == "example.com"
            assert getPort() == 42
            assert getPath() == "/docs/tutorial/index.html"
            assert getQuery() == "name=aim42"
            assert getRef() == "INTRO"

URIs are used to reference other resources. For HtmlSC it is useful to distinguish between internal (== local)and external references:

  • Internal references, a.k.a. Cross-References

  • External references

1.4.1. Intra-Document URIs

a file…​ ref can be an internal link, or a URI without protocol…​

1.4.2. References on URIs and HTML Syntax