Paradigm layouts

This document describes morphodict’s paradigm layout package.

In the context of this document, a paradigm is a table of related wordforms. The exact nature of how the wordforms are related to one another is not specified here — rather, that is for language experts to specify.

The paradigm layout package is intended to work in conjunction with an analysis and generation system, such as a finite-state transducer (FST) for inflection. This enables dynamic paradigm layouts (layouts with placeholders) to be used for a large number of lexemes with the same linguistic paradigm.

Components of the paradigm system

  • paradigm layout files (.tsv files) — specifies how to layout wordforms in one or more panes

  • the layouts directory structure — how to organize paradigm layout files on the filesystem. Subdirectories here indicate that a paradigm has multiple sizes

  • the Paradigm Manager — mediates access to all parsed paradigm layouts from the layouts directory, keeping track of the different sizes of layout for each paradigm, as well as tying dynamic layouts to a transducer to produce inflections

  • the relabelling system, which substitutes one or more tags with user-facing labels

Paradigm layout files

Paradigm layouts files describe how to arrange wordforms and labels in a table format.

Layouts are tab-separated values (TSV) files, where each cell specifies its role in the presented paradigm.

Paradigm layouts are:

  • a series of panes — a sub-table of related wordforms

  • each pane consists of a series of rows, which is either a header or a content row

  • each content row consists of a series of cells

  • a cell can either be a row label, a column label, or a wordform cell.

For example, this is a paradigm layout file for Swahili personal pronouns (Derived from: https://en.wikipedia.org/wiki/Swahili_grammar#Personal_pronouns):

| Class	| Ind	| Comb | Suffix	| Gen | Suffix	| All
_ 1 _ Sg	mimi	-mi	_angu	--
_ 2 _ Sg	wewe	-we	-ako	--
_ 3 _ Sg	yeye	-ye	-ake	--
_ 1 _ Pl	sisi	-si	-etu	sisi
_ 2 _ Pl	nyinyi	-nyi	-enu	nyote
_ 3 _ Pl	wao	-o	-ao	wote

This example is a static paradigm layout file with one pane. All but the first row have row labels, with two tags each. Each column has a column label, with some having one tag, and some having two tags. There are a three missing forms under the All column.

Syntax

A paradigm layout file is a tab-separated values (TSV) file. A TSV file is a series of rows. Each row is separated by a U+000A line feed character. Note that this U+000A line feed is not considered to be a part of the row in this specification.

The TSV file SHOULD have one trailing U+000A line feed character after the last row.

The TSV file MUST be encoded in the UTF-8 character encoding scheme.

Each row is a series of cells. Cells are separated from each other by one U+0009 CHARACTER TABULATION character (horizontal tab character). Let C be the number of cells in a row. Let T be the number of tab separators in a row. C is equal to T + 1.

Each row in the TSV file SHOULD contain the same number of tab characters (each row has the same T tab separators). This includes blank rows (see below).

This is to facilitate display and editing of the TSV files in tools that expect an equal amount of tabs per row (e.g., the GitHub previewer, VisiData, common spreadsheet software, PyCharm).

A paradigm layout file describes one or more panes, or a grouping of rows. Each pane is separated by one blank row. A blank row is line that only consists of whitespace (including the tab character). This is equivalent to a row that consists of C empty cells (see below). A blank row SHOULD consist only of zero or more tab separators.

A row can either be a:

  • header row: its first cell must be a header label, followed by whitespace.

  • content row: a row with any other kind of cells

A label is a single cell with one or more tags. Each tag in the label begins with a prefix, a space character, and then the tag proper. Each tag within a label is separated by a single space.

These are the different prefixes:

  • #: header labels: label that describes the entire pane it appears in

  • _: row labels: describes the row in the current pane

  • |: column labels: describes the column in the current pane

For example, # Past # Indicative is a header label that has two tags: Past and Indicative. The relabelling system is intended to interpret the tags to display a user-facing string.

A cell can be a label (as described above), an empty cell (consisting of zero or more whitespace characters, a missing form cell, consisting exactly of the string -- or a wordform cell. A parser MUST attempt interpreting the cell as a label cell, an empty cell, or a missing form before it may interpret the cell as a wordform cell.

A wordform cell is either a static wordform cell or a dynamic wordform cell. A dynamic wordform cell contains exactly one ${lemma} placeholder. The placeholder must be filled by an external system, and the cell may be substituted with a generated wordform. A static wordform cell does not contain a placeholder, and can be displayed to users verbatim, without any substitution. A paradigm layout is a dynamic layout when it contains one or more dynamic wordform cells. A paradigm layout is a static layout when it contains zero dynamic wordform cells. A dynamic layout MUST be filled before the layout is ready for presentation.

Aside from the presence or absence of the ${lemma} placeholder, the exact syntax of a wordform cell is defined by the implementation.

Partial Grammar

Here is a partial W3C Extended Backus-Naur Form (EBNF) grammar of the layout file specification. The syntax of WordformCell and Tag are implementation-dependent.

Layout ::= (Row NL)+

Row ::= BlankRow | HeaderRow | ContentRow

HeaderRow ::= HeaderLabel (TAB EmptyCell)*
ContentRow ::= ContentCell (TAB ContentCell)*

ContentCell ::= RowLabel
              | ColumnLabel
              | MissingForm
              | EmptyCell
              | WordformCell
BlankRow ::= EmptyCell (TAB EmptyCell)*

HeaderLabel ::= ('#' SP Tag)+
RowLabel ::= ('_' SP Tag)+
ColumnLabel ::= ('|' SP Tag)+
MissingForm ::= '-' '-'
EmptyCell ::= SP*

TAB ::= #x09
NL  ::= #x0A
SP  ::= #x20

The variant of EBNF chosen can be visualized using following tool: https://www.bottlecaps.de/rr/ui

Where to place paradigm layout files

Paradigm layout files are placed in the following directory structure template:

layouts
├── {paradigm-name}
│   ├── {size-1}.tsv
│   ├── {size-2}.tsv
│   └── {size-3}.tsv
├── {paradigm-name}.tsv
└── {paradigm-name}.tsv

Files directly within the layouts directory, can be:

  • a .tsv layout file, whose filename stem (part before the .tsv) is the paradigm name that this layout corresponds to; or

  • a subdirectory, whose filename is the paradigm name this directory corresponds to. Within this subdirectory are .tsv layout files that indicate the different size options available for this paradigm name.

This layouts directory should be placed in the dictionary-specific resources directory, e.g., src/arpeng/resources/layouts for arpeng.

Note: as of 2021-08, the Cree layouts are still in their legacy location, src/CreeDictionary/res/layouts. This is because the same layout files are used by both Plains and Woods Cree. The intention to move them to src/cr_shared/resources/layouts once code to support that is written.

How to configure paradigm sizes

The order of the paradigm sizes are configured in Django’s settings.py with the MORPHODICT_PARADIGM_SIZES key. List all named paradigm sizes in the order you wish for them to appear in this setting. For example, if you have the sizes “basic”, and “full”, and want them to appear in the order, make sure in your site’s settings.py you have the following:

MORPHODICT_PARADIGM_SIZES = ["basic", "full"]