Mozilla Archive Format Add-on - MAFF Specification

The MAFF specification

MAFF is meant to be a simple format for archiving a copy of some web content in a single file.

Saving web pages

MAFF can be used by a web browser as a destination file format when saving a web page, and all the other resources required for rendering it, to a local file system for later reference.

Even though the same operation could be done by saving multiple files with relative references to each other, in practice this approach is not robust. When moving the saved files in the same file system, care should be taken not to change their relative locations, lest the content become unusable. Moreover, when files are moved across file systems that support different naming restrictions, like when authoring a CD-ROM, automatic renames performed on the individual files may make the archived page unusable, and recovering at a later time can be difficult.

Collections of MAFF files, instead, can be safely organized using the operating system's file manager, and moved across different file systems with no risk of making the content unusable. No additional software besides the file manager is required in order to organize the collection.

For this use case, MAFF is comparable to MHTML, even though the two formats are very different in other regards.
Tracking the source of archived content

The original URL the content was saved from, and the local date and time of the save operation, are the essential information required to locate the source of the saved content. This information can be used for later reference.
Packaging related content together

MAFF files can also be authored originally, instead of being saved by a web browser. In this case, they can be used to package related resources that are not meant to be severed.

The fact that MAFF files can be easily inspected and edited with existing tools helps in making this independent authoring very easy.
Other considerations

In MAFF files, dynamic features like JavaScript are available, even though they can be expected to work properly only as long as they do not depend on external resources.

The MAFF file format is not meant to be a complete offline cache, or a file format that can be used to replay client-server transactions. Pages saved inside MAFF archives should be treated as an atomic unit. For example, when fetching the content from its original location for updating an archive, user agents should not mix portions of the saved page and portions of the updated content.

It should be possible to render the contents of a MAFF file consistently after it has been extracted to a local file system and metadata has been removed.

Design goals

Easy to use and implement

The specification of the file format must allow very simple implementations. For complex details, implementations should be able to rely on available libraries.
Based on existing and widely used technologies

When considering solutions of similar complexity, in order to reach a particular goal, existing web standards and widely used specifications should be preferred to a custom implementation.

Custom solutions may be appropriate if they are much more easy to implement compared to an existing specification, and the benefits of the existing specification do not apply to the particular use case.

Conformance levels

In order to satisfy the requirement of simplicity of implementation, and to encourage early adoption of the format, different conformance levels are available to implementors.

Elementary

Read-enabled implementations at this level can display the contents of most of the MAFF files generated by implementations at the elementary or basic level, and almost all of the MAFF files generated by implementations at the normal level, but may not always be able to access the metadata.

Write-enabled implementations at this level do not store metadata, or store it in such a way that does not guarantee that conforming readers will be able to read it. Implementations at the elementary or basic level will be able to display the contents of most of the MAFF files generated by implementations at this level, while implementations at the normal level will be able to display the contents of almost all of the generated MAFF files.
Basic

Read-enabled implementations at this level can display the contents of most of the MAFF files generated by implementations at the elementary or basic level, and almost all of the MAFF files generated by implementations at the normal level. Implementations at this level can access the metadata of most of the MAFF files generated by implementations at the basic level, and almost all of the MAFF files generated by implementations at the normal level.

Write-enabled implementations at this level may store metadata. Implementations at the elementary or basic level will be able to display the contents of most of the MAFF files generated by implementations at this level, while implementations at the normal level will be able to display the contents of almost all of the generated MAFF files. Implementations at the basic level will be able to access the metadata of most of the MAFF files generated by implementations at this level, while implementations at the normal level will be able to access the metadata of almost all of the generated MAFF files.
Normal

This level is not yet formalized, and no conforming implementations can exist at this time. Implementations at this level should keep into account most of the possible interoperability issues. Implementations that aim at reaching this level should be updated to reflect the evolution of the stabilized aspects of this specification.

Read-enabled implementations at this level can display the contents of almost all of the MAFF files generated by implementations at the elementary or basic level, and virtually all of the MAFF files generated by implementations at the normal level. Implementations at this level can access the metadata of almost all of the MAFF files generated by implementations at the basic level, and virtually all of the MAFF files generated by implementations at the normal level.

Write-enabled implementations at this level may store metadata. Implementations at the elementary or basic level will be able to display the contents of almost all of the MAFF files generated by implementations at this level, while implementations at the normal level will be able to display the contents of virtually all of the generated MAFF files. Implementations at the basic level will be able to access the metadata of almost all of the MAFF files generated by implementations at this level, while implementations at the normal level will be able to access the metadata of virtually all of the generated MAFF files.
Custom or extended

This level applies to ideas and requirements that are not part of the base specification, even though they are related to the file format and are under discussion. Some of these requirements may be outside of the scope of the base specification, and should be considered extensions, while others may be considered for inclusion in the basic or normal conformance levels.

Since none of the custom or extended requirements are meant to be stable, implementors may need to change their implementations substantially if conflicting requirements are introduced in the basic or standard conformance levels.

Definitions

Page: An atomic unit of related archived content. Multiple independent pages can be stored in a single MAFF archive.
Main document: The top-level file of a page, that is generally displayed in a web browser window.

File extension and type

MAFF files should be saved using the .maff file extension (lowercase is recommended), even on systems where file extensions are not normally used to identify file types. [Conformance level: elementary]

Implementations should treat files with the .maff file extension (case insensitive) as MAFF files, even on systems where file extensions are not normally used to identify file types. [Conformance level: basic]

The MIME type application/x-maff is suggested for files with the .maff extension. [Information]

ZIP implementation

The ZIP implementation must be based on PKWARE's ZIP Application Note [Conformance level: elementary].

File and directory names must be stored using UTF-8 [Conformance level: basic].

Directory structure

The root directory of the archive must be empty. [Conformance level: elementary]

One first-level directory must be present for every saved page. At least one page must be present in the archive. No additional first-level directories must be present. [Conformance level: elementary]

Every first-level directory should contain a file named index.rdf, with the metadata. [Conformance level: basic].

If the index.rdf file is not present, the main document must be stored in a file named index, with a file extension based on the content type. If the content type of the main document is HTML, the file must be named index.html. [Conformance level: elementary]

If the index.rdf file is present, the metadata must contain the name of the file containing the main document. This file must be located in the same folder as the index.rdf file. This file must be named index, with a file extension based on the content type, unless the file type is RDF. [Conformance level: basic]

Matching file extensions with MIME media types

Assignment of MIME type to individual files is implemented by ensuring that the file names of supported content match a list of well-known file extensions. [Conformance level: normal]

The well-known file extensions, with their related media types, are as follows:

audio/ogg = .oga
audio/x-wav = .wav
application/ogg = .ogg
application/x-javascript, application/ecmascript, application/javascript, text/ecmascript, text/javascript = .js
application/xhtml+xml = .xhtml, .xht
image/gif = .gif
image/png = .png
image/jpeg = .jpg, .jpeg
image/svg+xml = .svg
text/css = .css
text/html = .html, .htm
text/xml = .xml
video/ogg = .ogv

When storing files that don't have a well-known media type, the use of any file extension is acceptable. Generally, in this case implementors should use the extension to type mapping provided by the operating system. In case this association is not available, the file extension of the original file may be used. [Conformance level: normal]

Metadata

The format used for storing metadata about the archived files in the index.rdf and history.rdf files is RDF/XML.

Some restrictions for the RDF/XML file format are still to be specified. These restrictions are required to allow for read-enabled implementations that are as simple as possible. In particular, only one of the possible XML representations of the RDF graph is valid. This is the current representation that uses the MAF XML namespace. In this way, a full RDF parser would not be required to read the metadata.

Further restrictions on the structure and format of the XML files are under consideration. For example, implementations might be required to put one tag per line, always use UTF-8 encoding, and never encode characters as entities unless necessary. This would make room for very simple implementations that don't embed an XML parser.

The following information is stored in index.rdf:

The file name of the main document.
The original URL the page was saved from.
The date and time of the save operation.
The title of the page, if present.
The character set to use when parsing files that are part of the page.

Date and time

The date and time of the save operation should be stored using the format described in RFC 5322 section 3.3. If this format is not used, the Mozilla JavaScript Date format must be used. Implementations must be able to parse both formats. [Conformance level: basic]

Title

If the file format of the main document allows an explicit page title to be specified, like HTML does, the title from the metadata should match the title from the main document, and if the title is not available, this field should be omitted. [Conformance level: basic]

This metadata field allows applications that want to display information about the contents of MAFF archives to do this without embedding an HTML parser. For example, an extension for a file manager might only use the title from the metadata, while a web browser might ignore this field and only display the title resulting from parsing the main document. [Information]

Character set

The character set declared in the MAFF archive should be used when parsing the contents of all the files that do not declare a different character set. [Conformance level: basic]

Extended metadata

Inside each first-level directory of a MAFF archive, the second-level directory named ^metadata^ (case-sensitive) is reserved and should not contain actual content. A file or folder named ^metadata^ (case-insensitive) should not exist inside any first-level directory. [Conformance level: extended].

File names inside the ^metadata^ folder should be limited to a sequence of up to 20 lowercase ASCII characters or hyphens (-), followed by an optional lowercase file extension beginning with dot (.). [Conformance level: extended].

Inside the ^metadata^ folder, file names that begin with x- are reserved for custom extensions to the MAFF format. Implementors that want to store additional metadata that is not documented in this specification must store it using such reserved names, for example 12345678_123/^metadata^/x-custom-info.rdf.

The provisions above are candidates for inclusion in the normal conformance level. However, at present they are considered extended and are subject to change.

Contents of extended metadata

The base specification does not provide a facility for storing the original URL from which each individual file was copied from, as each page is handled as an atomic unit. This feature is under consideration as an extension, but even if this information is present, implementations must not rely on it to be able to properly display the page. [Conformance level: extended]

The base specification does not consider custom MIME or HTTP headers associated with individual files, as the MAFF format is designed to work in the same way as when a saved page is opened directly from a local file system. This feature is however under discussion as an extension. Implementations should not rely on this feature unless necessary, since it may not be available in other implementations. [Conformance level: extended]

Storing and replaying arbitrary MIME headers in an archive can be subject to security considerations. While static information about the content itself, like the Content-Type header, is generally safe to be used from every location, information about the relation of the content with other resources may not apply anymore when the resource is moved. For example, a site may be listed as allowed in a Content Security Policy header, but this trust relationship is only relevant at the time the content is generated, and should not be used later.

Test cases

Web archives demonstrating the features of the file format are available here: