TUTORIAL FOR EMIL VERSION 2.1

Written by Martin Wendel, ITS, Uppsala university. Martin.Wendel@its.uu.se

ANALYSIS

Introduction

Analysis is concerned with building an abstract model of the problem and an object structure for solving the problem. The details are stripped off, leaving a highly abstract description how the problem can be solved. The goal is to make the problem domain understandable and to provide a framework for the design.

Components of a Message

A message has a structure. The structure decomposes the message at different levels of abstraction. The highest level of abstraction is represented as an object describing the entire message. Lesser abstract levels consist of objects that may represent the parts of the message etc. Each of these objects has relations to attributes, data and to other objects. Typically a message object has relations to a header object and a body object (figure 1). Note that the object structure itself does not contain any data obtained from the message, it describes the structure of the message, using the message format, and the objects of this structure has relations to some pieces of data that may be part of the message.

    Figure 1. Rudimentary structure of a message.

The Message Object

The Message Object has a few attributes that are common to the entire message. Some of these attributes are:

The senders address
The recipient's address
The subject
The format type, for example MIME

Typically the values of these attributes are found in the envelope and the header of the message. However, the format type of the message can not always be unambiguously derived from the header, it may be necessary to perform a scan of the body aswell.

The Header Object

The header of a message is also structured. It may consist of several lines of text conforming with a syntactical pattern described by the message format, each having some semantic meaning and containing some data or information.

A header line consists of a pattern and some data, or arguments to the format. These arguments are contained in data objects (figure 2).

    Figure 2. The Header Object and it's relations.

The Body Object

The body may also contain structured data or information. Data may have a type in which case it is information. Data may be encoded and it may be represented in a character set. Data may also be binary and this is a distinction from text, see below.

RFC822 defines that the body of a message contains lines of text. Text is typically represented in a particular character set. It may be encoded and it may also contain encoded binary parts, this is a wide definition of text. MIME elaborates this and defines a grammar for body type and encoding . Still, a MIME text is no different from a general text and can thus contain encoded binary parts.

There are two types of bodies; single part bodies and multi part bodies. This distinction is handled by a boolean attribute of the body object. Each body object has relations to body part objects. When the body object has relations to more than one body part object it is a multi part body.

The body part object has some attributes describing the body part. Among those are:

Type
Encoding
Size
Character set (for text)

In case of a MIME message those attributes can be retrieved from the data objects in the header structure (The attributes are references to other objects). In some cases the attributes can only be retrieved from the body part data itself. The character set may also be retrieved from the configuration files.

The body part objects has relations to data objects containing the data of the body part (figure 3).

    Figure 3. The Body Object.

Relations within a Message

As was described in the previous section a message is a structured body of data. But it is important to understand the relationships within the structure aswell, understanding the structure only is not enough.

Envelope, Format and the Message

The envelope contains information about the sender and the recipient of a message. The header of a message contains To-address and From-address but these have no direct relation to the information in the envelope. While the envelope gives information about the origin and destination of the message to the MTA (Mail Transfer Agent), the header gives similar, but not necessarily identical, information to the recipient. The relationship of the envelope is to the Message object, the message itself.

The format (For example MIME) is a description of the structure and the syntax of a message. It describes what the message should look like for it to be properly handled by the MTA and the UA (User Agent). If the message does not conform with the mutually agreed format, chances are that the transport of the message will fail, and even if transport does not fail the recipient may not be able to view the message as intended. The format describes the message but the format is not part of the message. The relationships between the format and the message are primarily to the message object, the header line object and the body part object.

The objects within the message also have relations other than those of the structure itself. One of these is the relationship between the header line data and the body part object; In a MIME message filename and type is described in a header line for each body part. Another relation is the one between the body part object and the body part data; A BinHexed attachment contains file name and file type within the attachment.

Another relationship is between the body part data and the message object; In a MIME multi part message the body parts can be seen as message objects, they contain headers and a body. Although some body parts may contain an empty header, this is actually supported by the structure. This makes MIME multi part messages very different from other message types with multiple parts. This difference needs a workaround for the structure to be suitable.

The relations described above are displayed in figure 4.

    Figure 4. Relations between objects within a message.

The Format

The previous section described the format as one object. By looking at the relationships it is quite easy to see that the format must be described in a more complex manner. Indeed the structure of the format is not so much different from the structure of the message.

The format needs to declare what kinds of header lines and what types of body part objects it supports. If dealing with encoded body part data, the format also declares what methods of encoding are available for use.

Elaborating the format yields figure 5.

    Figure 5. Including a structured format object.

Message Structure

Looking back at the basic structure of a message, it is obvious that different formats generate different structures (figure 6). The greatest difference is between the unstructured formats of 6a and 6c and the structured formats of 6b and 6d. Converting between these two methods of structure is not so easy. Making the problem simpler would be to change the internal representation of the unstructured formats towards the representation of the structured formats (figure 7). Here the top level of the message (named 0, zero) is common to all formats, while the lower level 1 is only available for structured formats. The MIME formats allows an arbitrary depth of the structure while SUN Mailtool allows only level 0 and 1.

    Figure 6. Basic message structures. The object names are abbreviated 
    according to: M = message object, H = header object, B = body object, 
    P = body part object, D = data object. 6a shows a single part message. 
    6b shows a message with a single part attachment in SUN Mailtool 
    format. 6c shows a multipart unstructured format message. 6d shows 
    a multipart message according to MIME and SUN Mailtool.

    Figure 7. Preparing structure levels in the message structure.

Using this representation it will be possible to structure a message similarly for different formats. A non structured format must ignore the effects of message objects other than level 0 while a SUN Mailtool format ignores level 2 and deeper. Care must be taken while constructing the structure that the recursiveness of this representation is strictly controlled, avoiding unwanted loops. A message object generated because of a single body part object should be marked so that the descendant of the body part object one level below always is a data object.

Boundaries and other delimiters

One of the main problems of parsing a message is identifying the delimiters of the body parts. The end delimiter of a header is easy to find; a CRLF on a line on it's own. The end delimiter of a text is somewhat more hard to find.

If the body parts of an unstructured message is divided into two groups, parts of type text and parts of type encoded data, the classification becomes:

Text - This is the default type, thus the start condition is met when the start condition for an encoding is not. The end is met when there is a start of an encoding or at end of data (this should work in a data buffer aswell, therefore end of file is not generally applicable).
Encodings - An encoding is, as far as this document is concerned, a BinHexed or uuencoded body of data. The start conditions for these are the fairly unique starting strings used by these encodings. The end condition is also an encoding specific issue. However, the start condition is only met when the body of the encoding conforms with the syntax of the encoding format. This makes three stages that must be correct (start, body and end) for the start condition to be fulfilled.

There are two more methods of decomposing a message:

Unique boundaries - This is the method used by MIME. A unique boundary string acts as delimiter of the body parts. If the uniqueness of the boundary is not fulfilled, the message is corrupt. Because of it's nested capabilities MIME is also provided with a unique boundary as end condition both of the body part and of the multi part structure.
Size specification - This is used by SUN Mailtool, together with a non unique boundary string. A size specification is really only an end condition, SUN Mailtool uses a boundary as start condition.

Format recognition

When a message comes in Emil must be able to recognize the structure and format information to use when parsing it. Other information that can be of great importance is the default character set used by the sender. MIME and SUN Mailtool formats are specified in the message itself by specific header lines. When the message is not in either of those formats a default character set for the sender must be used.

Functional Model

The differences between input and output, although at the same level of abstraction, is to great to span in a single step. Emil uses a multipass design to accomplish transformation. Because of this, it is possible to divide the problem of transformation into several smaller problems making it easier to grasp.

Decomposing the overall function of Emil into clearly defined functions yields:

Load and parse message (tag objects with corresponding format name)
1. Get sender and recipient (these are specified as arguments)
2. Get target format (lookup recipient in the target database)
3. Get sender's default charset (lookup sender in the target database)
4. Load message (into a single character buffer)
5. Parse header (unfold and structure into tokens)
6. Try formats (compare header with the patterns defined in the formats)
7. Parse body (look for boundaries and other delimiters)
  - Assign body part descriptors, boundaries and method of decomposition
  - In case of a MIME multipart body return to 5 until end delimiter is found
  - In case of a SUN Mailtool multipart body return to 5 until end of buffer
  - Structure body into body part objects
Apply conversion of data objects
1. Compare object descriptors with target format
2. Convert non comforming encodings
  - Decode into a common representation
  - Encode into the specified target encoding
Add target format
1. Add target headers
2. Add target method of decomposition and boundaries
Output message (use only objects tagged with the target format name)

This is a rough model of the functions performed by Emil.

March 1996

ITS Uppsala university
Box 887
751 08 Uppsala
SWEDEN

Martin Wendel