The Definition File
Outline
The definition file is coded
in XML syntax according to rules given by the associated Triple-S
XML DTD (Document Type Definition). The definition file contents
describe two aspects:
a. the file itself in
terms of version number, date and time of creation etc.
b. the survey in
terms of the survey variables.
The following shows an
outline of the contents of the definition file.
<?xml version="1.0"?>
<sss version="1.2" options="option_setting">
<date>date_text</date>
<time>time_text</time>
<origin>origin_text</origin>
<user>user_text</user>
<survey>
<name>survey_name</name>
<title>survey_title_text</title>
<record ident="record_id">
<variable ident="variable_id"
type="variable_type">
. . .
variable_details
. . .
</variable>
. . .
<variable ident="variable_id"
type="variable_type">
. . .
variable_details
. . .
</variable>
</record>
</survey>
</sss>
Formatting
1.
Recommendations
In order to improve the readability of the definition file, it is
recommended that:
a. The file is organised into lines using CR, LF (decimal 13,
decimal 10) combinations. However they should be avoided within
elements that contain text (e.g.
<title>, <label> or
<value>) where their presence could affect
how the text is processed. Note that the
<br/>
element is provided in these situations to
indicate where new line breaks should appear within the text.
b. At most one element, or element with associated attributes,
appears on one line.
c. Lines are indented with space or tab characters to reflect the
structure inherent in the file. An indent is applied after every
element that contains other elements.
Comments may be used to annotate contents or to temporarily hide
sections of the file from the XML parsing mechanism. These are
standard XML comments and start with the conventional XML construct
of
<!--
and end with
-->.
Comments are optional and can appear any number of times in the
definition file.
a.
<!--comment_text-->
can be used anywhere (after the initial <?xml
> declaration) to
indicate parts of the definition file that are to be ignored.
b. A
comment_text may include any text
except two successive dash characters,
--.
For example:
<!--Data collected from 12-18th June 2002-->
XML uses the
Unicode character set. As a character set, Unicode is composed of 17
planes of up to 65,536 characters each. The first plane (plane 00)
is called the Basic Multilingual Plane and contains all the
characters that most survey definitions would need. In order to
represent such a large potential range of characters within a file
XML uses "encoded" characters.
The default encoding is UTF-8. In this the ASCII characters 00-7F
are represented unchanged, but all other characters are encoded as 2
or more bytes. So if the definition file only contains ASCII 7-bit
characters, which includes 0-9, a-z, & A-Z, then the default UTF-8
is suitable.
However, if the variable names or more likely any texts contain
8-bit characters then they must be encoded into UTF-8 or the actual
encoding must be specified. The most common encoding is
Latin-1 (or more precisely ISO-8859-1) in which case the initial XML
declaration should be:-
<?xml version="1.0" encoding="ISO-8859-1">
More details and resources on Unicode are available from
http://www.unicode.org.
Definition File Elements and Attributes
This section describes the
syntax and function of each of the keywords and keyword phrases used
to form a Triple-S XML definition file. The keywords are shown in
the order they are expected in the file.
<sss
version="sss_version"
[ options="option_setting"
]
[ languages="language_list"
] >
The <sss>
element is always required and is used to encapsulate the entire
specification document. It contains a mandatory attribute
version and optional
options and
languages attributes.
The
version attribute is used to indicate the
version of the Triple-S standard that applies to this specification.
If the sss_version is 1.1 then only elements and
attributes from the Triple-S XML version 1.1 standard are used. If
the sss_version is 1.2 then new elements and
attributes from the Triple-S XML version standard may be present.
The
options attribute is used to indicate that
some aspect of the Triple-S definition meets a defined standard that
may reduce the amount of checking required by an import program.
The only value that
option_setting may have in version 1.2
is standardnames. The definition of Triple-S standardnames
is:
consist of 1-8 characters
consist of only the characters A-Z (and a-z) and 0-9
start with a letter (A-Z or a-z)
case insensitive (i.e. upper and lower case are equivalent)
unique within the survey (e.g. Q1 and Q001 are different, q1 and Q1
are the same)
appear as the "name_text"
without leading, trailing, or embedded blanks
If all variable names in the
definition file conform to this definition then the standardnames
option should be set.
For example:
<sss
version="1.2" options="standardnames">
The
languages attribute is used to indicate
that there are some multilingual texts within the Triple-S
definition and to define the language identifiers that are used for
those texts. See the later section on Multilingual Texts for
more on surveys that have multi-lingual texts.
For example:
<sss
version="1.2" languages="en fr">
<date>date_text</date>
Optional. The
date_text should represent the date
the file was created.
<time>time_text</time>
Optional. The
time_text
should represent the time the file was created.
<origin>origin_text</origin>
Optional. The
origin_text should describe the
originating system (program and operating system).
For example:
<origin>MyProg v3, Windows 2000</origin>
<user>user_text</user>
Optional. The
user_text should indicate the name of
the user who created the file.
<survey>
Mandatory. Introduces
details of the survey itself.
<name>survey_name</name>
Optional. There are no
specific conventions for the
survey_name. For those systems with no
specific survey naming convention this element could be used to hold
the filename. The survey_name is not affected by the presence
of a
standardnames option in the
sss
element.
For example:
<name>SP1025</name>
<version>survey_version</version>
Optional. The content of the
survey_version is not defined, but some local conventions
will be required if it is to be used as anything more than just
descriptive.
For example:
<version>3.1</version>
<title>survey_title_text</title>
Optional. The
survey_title_text should represent the
survey title. The title may optionally include any number of
<br/>
directives to indicate new line breaks. If this is a multilingual
survey then the title may also contain language-specific texts (see
later section on Multilingual Texts).
For example:
<title>The Fitness Centre Survey<br/>First wave</title>
<record ident="record_id"
[ href="uri"
] >
Mandatory. One record
element starts after
<survey> (or any survey description
elements if present). It is used to introduce the definition of the
variables that are held in the Data File. The
record_id is any single character A to
Z or a to z.
The optional uri can
be used to specify an explicit location for the data file that is
described by this Triple-S XML specification. Note that using an
href
attribute ties the specification to the
data file and may cause problems if the specification and data files
are moved.
For example:
<record ident="A">
The
record_id can be used in conjunction
with the
variable_id (see the
<variable> element later) to generate a
unique variable name on import.
Then for each variable being
described there should be a block comprising:
<variable ident="variable_id"
type="variable_type"
[
use="use_type"
]>
Mandatory. The
variable_id is an integer number of up
to four digits, in the range 1 to 9999, with or without leading
zeroes. Each
variable_id must be unique within the
<record> ... </record> block.
The
variable_type must be one of:
single
- categorical with one response allowed
multiple
- categorical with any number of
responses
quantity
- numeric value (integer or real)
character
- character value
logical
- Yes/No or True/False value
The use_type is
optional and describes the role of this variable in the
specification. It must be one of.
serial
this variable contains the serial
number (or other identification field) for the case. There can be at
most one
serial variable and it must be either a
character or
quantity. The data values must be unique
and should not be missing.
weight
this variable contains a case weight.
There can be at most one
weight variable and it must be a
quantity. The data values should be
non-negative and not be missing.
For example:
<variable ident="10" type="single">
or: <variable ident="1"
type="quantity" use="serial">
<name>name_text</name>
Mandatory. The
name_text should represent the name
the variable had in the original survey.
For example:
<name>Q1a</name>
Note that if the names
conform to the definition of Triple-S standardnames then a
standardnames directive should appear in the
options attribute of the <sss> element.
<label>label_text</label>
Mandatory. The
label_text should represent the label
or question text of the original variable.
For example:
<label>First visited</label>
The title may optionally
include any number of
<br/>
directives to indicate new line breaks. If this is a multilingual
survey then the label may also contain language-specific texts (see
later section on Multilingual Texts).
<position start="start_location"
[
finish="finish_location"
] />
Mandatory. Describes the
location of the data values within the data record. The
start_location and
finish_location are positive integers,
which represent the character positions, with the first position in
the data record being 1.
For example:
<position start="21" finish="24"/>
The
finish_location must be greater than
or equal to the
start_location. The
finish attribute may be omitted if the
finish_location is the same as the
start_location.
The
<position> element defines the part of the
data record that is allocated to holding the value of the variable.
The
<size>,
<values> and
<spread> elements describe which parts of
the data record are to be interpreted as the value, and what are the
legal values of the variable. As a consequence the
<position> element must define an area
that is at least as large as that implied by the
<size>,
<values> and
<spread> elements.
The parts of the data record
defined by the
<position> elements may appear in any
order, may overlap each other, and do not have to describe the
entire data record. It is recommended that import programs ignore
all parts of the data record not defined by
<position> elements, including those
beyond the highest location defined by a
<position> element.
<filter>filter_name</filter>
Optional. The
filter_name must be the name (as
defined by the <name>
element) of a previously defined
logical variable. The value of this logical variable determines
if the current variable is available for that case.
For example:
<filter>EverVisited</filter>
The name of the logical
variable must be unique. All variable types, including logical
variables, can have a
<filter>
element. However, note that variables used as a
serial or
weight must have no missing values, hence
it is inappropriate for these to have a <filter>
specified.
The elements that can follow
the
<position> or
<filter>
element vary according to the
variable_type :-
single
Mandatory values element
multiple
Optional spread element
Mandatory values element
quantity
Mandatory values element
character
Mandatory size element
logical
Nothing extra
<spread subfields="num_subfields"
[ width="subfield_width"
] />
Optional and only used with
multiple type variables. The
<spread> element indicates that the data
values are coded as a series of category values in consecutive
subfields (rather than the default multiple format of a series of
0/1 characters).
The
num_subfields attribute must be a
positive integer, and denotes the number of subfields within the
overall field that is defined by the
<position> element. The
subfield_width is also a positive
integer and denotes the width of each subfield. Therefore the
<position> element must define a width of
at least (num_subfields
*
subfield_width). The subfield_width
must be large enough to hold the largest category value
specified for the multiple.
For example:
<spread subfields="5" width="3"/>
5 subfields of width 3
The
width
attribute may be omitted if the
num_subfields exactly fills the area
defined by the
<position> element. In this case the
subfield_width is determined by
dividing the width derived from the
<position> element by
num_subfields.
<values>
</values>
Mandatory for single,
multiple and quantity types. The
<values> element is used to define the set
of legal values and optional text labels for values (e.g.
categorical codes).
A
<values> element contains at
least one of a
<range> or one or more
<value> elements. If a
<range> is present then it must be the
first element.
The details of the
start_value,
finish_value and
code_value depend on the type
specification.
<range from="start_value"
to="finish_value"
/>
Optional first or only
element. The
<range> indicates an overall range of
legal values for the variable. The
finish_value must be equal to or
greater than the
start_value. This may be followed by
any number of
<value> elements each
defining a particular value.
<value code="code_value">value_text</value>
Any number of optional
elements that are used to give labels to specific values. The
value_text may optionally include any
number of
<br/> directives to indicate
new line breaks. If this is a multilingual survey then the value may
also contain language-specific texts (see later section on
Multilingual Texts).
If no
<range> element has been specified then
there must be at least one
<value> element. If a
<range> element has been specified then
the
code_value may lie within or outside
the defined
start_value and
finish_value. Apart from this, all
code_values
must be unique within each
<values> element.
For
single and multiple variables:
The
start_value,
finish_value and
code_value must all be positive
integers. The
<value> elements do not need to be in any
order, nor be complete. There is no upper limit to the number of
<value> elements, which may be specified
within the corresponding variable definition.
For example:
<values>
<!--3 labelled categories-->
<value code="1">Yes</value>
<value code="2">No</value>
<value code="9">Refused</value>
</values>
or:
<values>
<!--unlabelled with two explicit categories-->
<range from="1" to="99" />
<value code="98">Dont Know</value>
<value code="99">Refused</value>
</values>
For
quantity variables:
The
start_value,
finish_value and
code_value explicitly define the valid
range, and implicitly define the format and physical size of data
for the variable. The valid range for a variable of type quantity
can include positive or negative values. Negative values are
identified by a single leading minus sign, '-'. Positive values are
identified by the absence of a sign.
For example:
<values>
<!--integers from 1 to 100-->
<range from="1" to="100" />
</values>
or:
<values>
<!0 to 500 with 2 dp, plus 1 explicit value-->
<range from="0.00" to="500.00" />
<value code="999.99">Dont Know</value>
<values>
The number of decimal places
must be the same for all values used in the values block. The number
of decimal places must be identical to the number of decimal places
used to represent the data in the corresponding data file.
Values in the definition
file must contain at least one digit. The use of a decimal point is
optional for integer values. The following table gives examples of
correct and incorrect representations:
Value
|
|
|
1.0 |
Correct |
|
+1.0 |
Incorrect - 'plus' sign
not allowed |
|
-1.0 |
Correct |
|
-
1.0 |
Incorrect - contains
embedded spaces |
|
1. |
Correct |
|
.1 |
Correct |
|
-.1 |
Correct |
|
-. |
Incorrect - no numeric
digits present |
There is no upper or lower
limit to the magnitude of the values that may be assigned to a
quantity variable.
<size>size_specification</size>
Mandatory for character type
variables. Defines the maximum number of characters in the data for
the variable. The
size_specification must be a positive
integer; there is no defined upper limit to the
size_specification.
For example:
<size>100</size>
Finally, for all variable
types:
</variable>
Mandatory. Completes
definition of the variable.
Then either the definition
of another variable (introduced by another variable element), or:
</record>
Mandatory. Finishes the
definition for the set of variables.
</survey>
Mandatory. Finishes the
definition for the survey.
</sss>
Mandatory. Finishes the
definition file.
Multilingual
Texts
The text for a
survey <title>,
variable <label> or
values <value> element may contain any
number of language-specific text elements. These are in addition to
the plain or default text that should always be present for these
elements.
For example:
<value code="1">Yes
<text
xml:lang="en-GB">Yes</text>
<text
xml:lang="en-US">Sure</text>
<text
xml:lang="fr">Oui</text>
</value>
<text xml:lang="language_identifier">formatted_text</text>
Optional. The
formatted_text may optionally include any number of
<br/> directives to indicate
new line breaks.
Although there is no
restriction on
language_identifier, the intended values
of the
xml:lang attribute are described in the
official W3C XML version 1.0 specification as:-
"The values of the attribute
are language identifiers as defined by IETF (Internet Engineering
Task Force) RFC 1766, Tags for the Identification of Languages
(http://www.ietf.org/rfc/rfc1766.txt)
or its successor on the Standards Track.
Note:
IETF RFC 1766 tags are
constructed from two-letter language codes as defined by ISO 639 (Codes
for the representation of names of languages), from two-letter
country codes as defined by ISO 3166 (Codes for the
representation of countries and their subdivisions part 1 (country
codes)), or from language identifiers registered with the
Internet Assigned Numbers Authority, Register of Language Tags.
It is expected that the successor to IETF RFC 1766 will introduce
three-letter language codes for languages not presently covered by
ISO 639."
The use of multilanguage
texts within the specification must be signalled by a
languages="list"
attribute in the initial
<sss>
element.
For example:
<sss
version="1.2" languages="en-GB en-US fr">