Glossary

This is a glossary of terminology as used in the intelligent dictionary app. This is a combination of operational terms used within the dictionary code, general linguistic terms, and terminology used by specific approaches to describing certain languages.

Terms

analysis

also, linguistic analysis or linguistic breakdown.

An ordered set of the lemma and morphosyntactic features that can describe an inflected wordform.

It minimally consists of:

  • at least one lemma

  • at least one feature, stating the wordform’s word class

Example

One possible linguistic analysis of the wordform “sabía” in Spanish is:

saber+V+Past+1Sg

In other words, the breakdown is:

  • It’s a form of saber (the lemma)

  • It’s a verb

  • It’s past-tense

  • It’s actor is first-person, singular

Contains

Describes

  • 1 wordform; note, a single wordform can have multiple distinct analyses.

conjugation

A type of inflectional category for the verb word class.

conjugator

(informal) a tool that generates a paradigm.

N.B.: people ask for a conjugator, even when asking to generate noun wordform!

declension

A type of inflectional category for the noun word class.

definition

One of possibly several meanings of the head.

Part of

Describes

derivational breakdown

A derivational breakdown of a wordform contains different morphemes that makes up the wordform.

Example

  • atahkw + is + iw is the derivational breakdown of acâhkosiwiw

  • star + let + ify is the derivational breakdown for the coined English word startletify (to make something a little star)

derivational paradigm

The collection of all possible derived forms belonging to a lemma.

Part of

Contains

derived form

A new wordform created from a lemma; this new wordform has a separate lemma with its own inflectional paradigm. A derived wordform can belong to a different word class than the original source stem.

Part of

dictionary

???

dictionary entry

The main content of a dictionary. Consists of the head (in one or more orthographical representations), the word class, and the definitions.

Part of

Contains

See also

dictionary source

An edited repository of dictionary entries. A dictionary source has at least one of the following:

  • an editor/editors

  • an author/authors

A dictionary sources provides at least one or more dictionary entries.

A dictionary source may have other bibliographic metadata, like a book or a publication.

indeclinable particle

(In Plains Cree linguistics) The word class of terms that do not inflect. Often abbreviated as Ipc.

Is a

inflectional category

A more detailed categorization of a word class. Things that belong to the same inflection category have the same affix set.

Examples

  • NI-1

  • VTA-n

  • NDA-4w

⚠️ A deprecated synonym exists — This was formerly also called an inflectional class, but that term is now deprecated.

inflectional paradigm

The collection of inflected wordforms belonging to a lemma. Informally known as the conjugations.

Part of

Contains

general word class

Superclass of word class. Does not contain inflectional categories.

General word classes are are not detailed enough to tell you how its members inflect. A word class, on the other hand, tells you enough to be able to inflect.

Consists of

In Plains Cree

  • Noun — contains the word classes: NI, NA, NID, NAD

  • Verb — contains the word classes: VII, VAI, VTI, VTA

  • Indeclinable particle

gloss

Note: use translation instead!

Sometimes a sloppy synonym for translation. More specifically, a gloss is a one-to-one mapping between one language and another, often accompanied by relevant tags for morphosyntactic features. Glosses are more specific and less “fluent” than a translation.

inflected form

???

lacuna

“Gaps” in a paradigm. Any form that does not exist in a paradigm. For example, the English word “pants”:

Singular

Plural

pants

Pants doesn’t have a singular form! There’s “pant leg”, but no “*pant” This is a lacuna: a gap in the paradigm, where you would otherwise expect a valid form.

language pair

Each dictionary gives target language definitions for entries written in a specific source language.

The specific combination of source and target languages in a dictionary application is called a language pair, e.g., the language pair of Plains Cree to English for itwêwina.

In filenames and throughout the morphodict code, to distinguish between different dictionary applications, the abbreviation sssttt is used, where sss and ttt are the 3-character ISO 639-3 language codes for the source language and target language respectively, of the dictionary.

For example, for the morphodict Plains Cree-to-English dictionary this is crkeng, and many code and data file paths will contain the string crkeng. For example, you will find files such as the test dictionary at the path ../src/crkeng/resources/dictionary/crkeng_test_db.importjson

lemma

The base form of a word form; this is a form chosen to depict the basic representation of the paradigm. Often the least structurally and semantically marked form. Unlike a stem or root, a lemma is always a valid word form.

In a dictionary, the definitions of a lemma implicitly provide definitions for the inflected forms of the lemma.

If a term is defined in a dictionary, its head will be the lemma. e.g., you might not find a definition for “smartphones” in a dictionary of contemporary English; instead, you’ll find a definition for “smartphone” (the lemma), and “smartphones” is one of its inflected forms. However, non-lemma wordforms may also be heads in a dictionary, depending on context.

Whether non-lemma wordforms can have their own definitions is sometimes a point of controversy among linguists. Some would argue that providing a distinct definition for a non-lemma wordform implies that it is its own lexeme. But the counter-argument is that specific wordforms in a lexeme can have their own connotations, especially in morphologically complex languages, and not all of these connotations are necessarily distinct enough to create an entirely new lexeme.

morphodict does support having definitions for non-lemma wordforms.

Part of

lexeme

A related set of wordforms.

Other sources may also call this a lexical entry or lexical item.

meaning

???

morpheme

An indivisible part of language with meaning; A morpheme cannot be broken down into any subsequent parts, without changing its meaning.

morphosyntactic feature

???

multicharacter symbol

In LEXC, a symbol in the FST’s alphabet that is realized in text form as multiple Unicode characters. These are used for tags, e.g., +V, +TA, +Err/Orth; and special symbols used in phonological rules, e.g., the t2 in nit2<nipa>n.

Note to FST implementors: since tags are always multicharacter symbols, if the FST output has all the symbols separated, then there is no need to parse the analysis to find tags.

For example, “nêpât” is transduced to the following ten symbols (separated by |):

IC+ | n | i | p | â | w | +V | +AI | +Cnj | +3Sg

normatize

Write things according to the orthographical norm. A norm is implicitly and unconsciously decided by a community of writers. To normalize the spelling of something is to make it match the spelling expected by a community. A language may have many norms.

See also: standardize

  • e.g., the normative form of “alot” is “a lot”

  • e.g., the normative form of “icecream” is “ice cream”

  • e.g., the normative form of “atchakosuk” is “acâhkosak”

orthographical representation

???

paradigm layout

A formal specification that describes how to arrange (in a table) the inflections or derived wordforms of any lexemes belonging to a particular word class; or, how to arrange related wordforms in a table.

Subtypes:

dynamic paradigm layout
paradigm layout that has placeholders for the lemma or other morphosyntactic information that may be replaced when generating a rendered paradigm. These are the types of paradigm layouts used when describing an entire word class.
static paradigm layout
paradigm layout in which all forms are explicitly specified; there are no placeholders

part of speech

⚠️ Deprecated — use word class instead.

The grammatical category to which a term belongs. Different parts of speech have different functions in a clause.

Part of

phrase

Multiple word forms that, together, have one meaning. A dictionary entry may use a phrase as a head.

Is composed of

Can be a

root

The smallest form of a term (a morpheme) from which all inflected forms are based off of. The root might not be a valid wordform.

For example, in English, childr- is the root of child and children.

In Plains Cree

  • *atimw- is the root of the lemma atim, however, it is not a valid wordform on its own. It can be inflected to create atim and atimwak.

  • mow- is the root of the lemma mowêw, and it also happens to be a valid inflected form of mowêw (an imperative form)

source language

In a unidirectional bilingual dictionary, the language of the head words.

Example: in Cree: Words, which gives a list of Cree head words with all definitions being English translations, the source language is Cree.

See also: target language.

standardize

Write things according to the orthographical standard. A standard is explicitly and consciously decided by an individual or body to be adopted by a greater community. A language may have many standards, or it might have no standard orthography. When there is one widely-adopted standard, then it is also the norm: then “standardize” and “normative” are synonymous.

See also: normatize

tag

A multicharacter symbol that represents a linguistic feature.

In Plains Cree

In the Plains Cree FST, these tags either end with a + for prefixes (e.g., PV/e+, or start with + sign for everything else (e.g., +N, +TA, +V).

  • General word class: +V, +N, +Ipc, +Prop

  • Word class +TA, +TI, +VI, +I, +A

  • Whether a noun is dependent: +D

  • Tense: +Prs, +Fut, +Prt (really, denotes which tense preverb exists)

  • Order: +Ind, +Cnj

  • Subject: +1Sg, +3Pl, +4Sg/Pl, +5Sg/Pl

  • Object: +1SgO, +3PlO, +4Sg/PlO

  • The possessor of a noun: +Px1Sg, +Px2Sg, +Px4Sg

  • Preverbs: PV/e+, PV/kaa+

  • Reduplcation: RdplW+, RdplS+

  • and many more!

See this document for more info: https://giellalt.uit.no/lang/crk/crk.html

target language

In a unidirectional bilingual dictionary, the language of the definitions.

Example: in Cree: Words, which gives a list of Cree head words with all definitions being English translations, the target language is English.

See also: source language.

translation

A definition written in a different language than the head it is defining.

user query

also query, search string.

How the user writes their search intent, as a series of Unicode code points. This might be a messy, misspelled, strangely written string. It is the job of the intelligent dictionary to take this wild thing and make sense of it, returning results that satisfy the user’s search intent.

word class

Category of a set of terms that inflect in a similar way. Members of the same word class behave morphologically in a similar way to each other.

Contains

in Plains Cree

These are the word classes in Plains Cree:

  • NA: 🧑🏽 — animate noun

  • NI: 📘 — inanimate noun

  • NAD: 👤🧑🏽 — dependent animate noun

  • NID: 👤📘 — dependent inanimate noun

  • VII: 📘➡️ — intransitive inanimate verb

  • VAI: 🧑🏽➡️ — intransitive animate verb

  • VTI: 🧑🏽➡️📘— transitive inanimate verb

  • VTA: 🧑🏽➡️🧑🏽— transitive inanimate verb

  • Ipc

More specific categorizations inside a word class are inflectional categories such as NI-1.

⚠️ A deprecated synonym exists — This was formerly also called a specific word class, but that term is now deprecated.

wordform

In linguistics, the different ways that a word can exist in a language. (Not to be confused with lemma – which is its own special type of wordform). A wordform must be able to exist by itself. Contrast this to morpheme and phrase.

stem

In linguistics, please use the term root instead.

In natural language processing and information retrieval, the stem is a potentially garbled form of the input term that aids in indexing a large number of related terms. Typically this involves using naïve heuristics to remove both inflectional and derivational affixes from the input term. The stem does not have to be linguistically meaningful, and the stem is often not a valid wordform.

For example, “connection” can be stemmed to “connect” using the Porter stemming algorithm.

Naïve stemming heuristics can be replaced with a linguistic analyzer that is able to return the lemma of a term, however, this is not available for every language, and may not be necessary to create a satisfactory information retrieval system.

term

???