Dictionary data¶
Dictionary applications require dictionary data.
importjson¶
morphodict requires dictionary data to be supplied in a custom format
called importjson. It looks like this:
[
⋮,
{
"analysis": [[], "nîmiw", ["+V", "+AI", "+Ind", "+3Sg"]],
"head": "nîmiw",
"linguistInfo": { "stem": "nîmi-" },
"paradigm": "VAI",
"senses": [{ "definition": "s/he dances", "sources": ["CW"] }],
"slug": "nîmiw"
},
{
"analysis": [[], "nîmiw", ["+V", "+AI", "+Ind", "+X"]],
"formOf": "nîmiw",
"head": "nîminâniwan",
"senses": [
{
"definition": "it is a dance, a time of dancing",
"sources": ["CW"]
}
]
},
⋮
]
This format is a subset of JSON that is intended to match the internal data structures and terminology of morphodict code. It is not recommended as a long-term storage format. It does not have room for everything a linguist might want in a dictionary, only for the things that morphodict currently supports. And it is unlikely to use your preferred linguistic terminology.
In practice you will want to store your canonical dictionary data in some other format, such as Daffodil, and write a short script to convert that to importjson format.
If you do not yet have all of the data available—e.g., you may have a draft dictionary that does not have paradigms or analyses assigned to entries—it is still hugely beneficial if you can provide initial dictionary data containing:
All the definitions you have available
At least one FST-analyzable entry for every paradigm
That is, when getting started, it’s better to have a full dictionary that needs revision and is sparse on detail but provides full details for at least one entry in every paradigm, than to have full details for more entries but only from one or two paradigms. And it’s much better to have some initial draft dictionary data covering the whole language, allowing people to start working with it, than to have no data at all while waiting for it to be ‘right.’
importjson formatting suggestions¶
To make it easier for humans to look at the raw dictionary data and to compare different versions of that data, it is extremely strongly recommended that you run
./sssttt-manage sortimportjson FILENAME
on any importjson files before committing or sharing them.
That automatically implements the following suggestions:
Run the importjson files through
prettierto format them nicely. If using the command line, you’ll need to add a--parser=jsonargument as prettier does not automatically recognize the.importjsonfile extension.Sort the entries by
slug, with anyformOfentries coming immediately after the corresponding lemma entry, and relatedformOfentries sorted together byhead.Compare strings using NFD unicode normalization so that accented and non-accented characters sort near each other.
Explicitly sort the keys of the emitted JSON objects. Otherwise the JSON object keys can be emitted in random or insertion order, creating unnecessary noise in diff output. In JS you can try the json-stable-sort package, and in Python the
json.dumpfunctions take an optionalsort_keys=Trueargument.Make sure that the JSON you are emitting has unicode strings instead of asciified strings with human-unfriendly escape sequences. For example, in python, the
json.dump()function needs an extraensure_ascii=Falsekeyword argument or you will get"w\\u00e2pam\\u00eaw"instead of"wâpamêw".
importjson specification¶
An importjson file is a JSON file containing an array of entries.
There are two kinds of entries, normal entries and formOf entries.
Normal entries¶
The fields are:
The
headfield, a required string, is the head for the entry, in the internal orthography for the language. This is what is literally displayed in as the headword in the entry. It is usually a single wordform, but also could be a multiword phrase, non-independent morpheme, stem, &c.The
sensesfield, a required array of{definition: string, sources: string[], coreDefinition?: string, semanticDefinition?: string}objects, contains the definitions for the entry.Only the
definitionandsourcesfields are required.The
definitionmust currently be an unformatted string. We are aware that people would like to specify things such as source-language text to be shown in the current orthography, cross-references, notes, examples, and so on, in a more structured manner.If we were starting from scratch we might call the
definitionfield adisplayDefinition, but we already have some data, and often it is the only one definition field provided.The
sourcesare typically short abbreviations for the name of a source.sourcesis an array because multiple distinct sources may give the same, or essentially the same, definition for a word.The optional
coreDefinitionfield may specify a definition to use for auto-translation. For example, if an entry for ‘shoe’ has lots of details and notes, but when auto-translated into first person possessive it should simply become ‘my shoe’, you can specify the core definition asshoe.The optional
semanticDefinitionfield may specify a definition to use instead of the main definition text for the purposes of search. It will be used instead of the plaindefinitionfield for indexing keywords, and when computing definition vectors for semantic search. This is related to the concept of the core definition, but may add additional relevant keywords, while leaving out stopwords or explanatory text such as ‘see’ in ‘[see other-word]’ or the literal word ‘literally.’The
slugfield, a required string, is a unique key for this entry that is used for several purposes, including user-facing URLs, to makeformOfreferences, and for homograph disambiguation.Each normal entry must provide a
slugthat is distinct from every other normal entry. It is recommended that this field be the same as thehead, including diacritics, but with any URL-unsafe characters stripped, and a homograph disambiguator added at the end if needed for uniqueness.How to create disambiguators is up to the lexicographer. For example, for the Plains Cree homograph
yôtin, the lexicographer might set the slugs for the three entries toyôtin@1andyôtin@2andyôtin@3; or toyôtin@naandyôtin@niandyôtin@v; or following any other scheme they desire, so long as the assignments remain relatively stable.Any homograph disambiguator should start with
@, as there is code in morphodict to redirect an attempt to access an invalid homograph disambiguator like/word/nîmiw@foointo a search fornîmiw, whereas something likenîmiw-foowill do a search fornîmiw-fooinstead ofnîmiw. That way if disambiguators change due to adding and removing entries/definitions, old links should still be useful to people.The
paradigmfield, an optional string, is the name of the paradigm used to display the paradigm tables for the entry. This may be a static or dynamic paradigm.This field may have a null value, or be omitted entirely, if the entry is indeclinable.
The
analysisfield, an optional array, is the analysis of the headword. This field is used to populate the linguistic breakdown popup shown by the blue ℹ️ icon.The required format is that of an entry from the list returned by
lookup_lemma_with_affixes, e.g.,[[], "nîmiw", ["+V", "+AI", "+Ind", "+3Sg"]]. The format is:Array of:
Array of FST tag strings
FST lemma string
Array of FST tag strings
This field may have a null value, or be omitted entirely, if the entry is unanalyzable.
The
linguistInfofield, an optional arbitrary JSON object, allows extra presentation data to be stored in the database. Morphodict HTML templates have access to this data for the purpose of displaying additional data to the end user.Morphodict does not use any of this data for its core language-independent functionality.
It is recommended not to put any unused data in here that ‘might be handy later,’ but only to add new things here when required as part of a coordinated effort with the frontend code to add new user-facing features.
The
fstLemmafield, an optional string, is the FST lemma to use when generating dynamic paradigm tables for unanalyzable forms in dictionaries that support that. It must not be specified when there is also ananalysisfield.To be clear on the concept that we’re talking about: the FST lemma is the thing that gets plugged into a paradigm layout template to generate associated wordforms. For example, if the head is
nîminâniwanwith FST analysisnîmiw+V+AI+Ind+X, then the FST lemma isnîmiw. If the head isnimîwwith the FST analysisnimîw+V+AI+Ind+3Sg, then the FST lemma isnimîw, which is the same as the head.Normally the FST lemma is included as part of the
analysis, and the code can retrieve the conceptual FST lemma from thatanalysis, so the separatefstLemmafield is redundant and should not be explicitly included.However, sometimes it is desirable to have a dictionary entry that is not analyzable but for which dynamic paradigms should be displayed. For example, in Arapaho, one entry has the non-analyzable stem
níhooyóó-as a head, which should display dynamic paradigms using the FST lemmanihooyoo. Therefore, that is precisely what thefstLemmafield for this entry contains.In that case of dynamic paradigms for non-analyzable
headentries, and only in that case, is this field useful.This field is only supported for languages with the
MORPHODICT_ENABLE_FST_LEMMA_SUPPORTsetting enabled, which is currently only Arapaho.
Note that the only strictly required fields are head, slug, and
senses. If no other fields are supplied, morphodict will still work, but
many interesting and useful features of morphodict will not; you will
essentially have a static dictionary application.
formOf entries¶
These entries add additional definitions to inflected forms of normal entries.
In the morphodict application, formOf entries are described as being a ‘form of’ the corresponding normal entry.
Some linguists feel that, lexicographically, these entries may be more appropriate as standalone normal entries, on the grounds that having a distinct definition implies being a distinct lexeme. Others argue that there is room for certain inflected forms to have their own connotations and shades of meaning, even in English, but especially in morphologically complex languages.
The fields are:
The
formOffield, a string, must equal theslugof a normal entry in the same importjson file.The
headfield has the same format and meaning as for normal entries.The
sensesfield has the same format and meaning as for normal entries.The
analysisfield has the same format and meaning as for normal entries, but with the additional condition that the formOfanalysisFST lemma must equal the FST lemma of the corresponding normal entry.
No other fields are valid on formOf entries.
Validation¶
The following importjson validation checks are intended to be implemented for morphodict:
Every entry must have at least one non-empty definition, and every definition must have at least one valid source.
If an
analysisis specified, it must be one of the results returned from doing an FST lookup on theheadIf an
analysisis specified, theheadmust be one of the results returned by running theanalysisthrough the generator FSTfstLemmamay not be specified if there is ananalysisThe FST lemma in every
formOfanalysis must match the FST lemma of the corresponding normal entryStrings must not begin with a combining character. If a string is intended to start with a diacritic, e.g., a floating tone such as
"´a", or" ̣gwà…", use a non-combining character such as´, or if there is no non-combining equivalent such as for Combining Dot Below, put the combining character on a space, a non-breaking space, or a U+25CC ◌ Dotted Circle.The
slugmust not contain certain URL-unsafe characters, e.g.,/
Caveats¶
Known issues with the importjson format:
In many dictionaries, the order of definitions is very important, with the most common definitions being listed first. There is currently no code in morphodict to explicitly store or preserve the order of definitions. The dictionary is currently largely working by coincidence in that the import process and database tend to show the same results to the user as what was in the importjson file.
This isn’t so much an issue with the input format, which does have an explicit definition order, as a warning that this order may not be preserved.
Where do files go?¶
Each full dictionary for language pair sssttt is intended to be
placed at
src/${sssttt}/resources/dictionaries/${sssttt}_dictionary.importjson
There is a .gitignore rule to prevent accidentally committing them.
Building test dictionary data¶
Test dictionaries are created by taking subsets of the full dictionaries,
and storing them beside the full dictionaries as
${sssttt}_test_db.importjson.
These files are checked in so that people can do development and testing without having any access to the full dictionary files which have restricted distribution.
To update them, you’ll need a copy of the full dictionary file.
Edit
src/${sssttt}/resources/dictionary/test_db_words.txtRun
./${sssttt}-manage buildtestimportjsonto extract the words mentioned intest_db_words.txt, from${sssttt}_dictionary.importjsoninto${sssttt}_test_db.importjsonCommit your changes and send a PR.
Exception: the current crkeng test database omits many unused keys,
e.g., wordclass_emoji, that currently exist in the production
crkeng_dictionary.importjson file.
Current dictionary data¶
In theory, linguists will provide comprehensive and correct dictionary data in the morphodict-specific importjson format.
In practice, at this time, full dictionaries for each language arise as follows:
For Plains Cree, there is
crk/dicts/crkeng_dictionary.importjsonchecked in to the secret git repository ataltlab.dev:/data/altlab.git. This is what’s used in production. It was created by importing the oldcrkeng.xmlfile into an older version of the software that did a lot of paradigm and analysis inference during import, and then the database contents were exported in the new importjson format.For Woods Cree, the
UAlbertaALTLab/munge:/cwdeng/cwdizescript transliterates the productioncrkeng_dictionary.importjsonfile, using thedatabase.ndjsonfile from the ALTLab repo to get the proto-Cree forms.For Arapaho, the private arp-db repo has
arapaho_lexicon.json, and theUAlbertaALTLab/munge:/arpeng/toimportjson.tsscript transforms that to importjson.For Tsuut’ina, there’s a deprecated spreadsheet on Google Drive. Get the link from
srs/README.mdin the ALTLab repo, as I’m not sure it’s intended to be public.Download the single-tab spreadsheet as a .tsv file, and
UAlbertaALTLab/munge:/srseng/toimportjson.tswill transform that to importjson.
None of these are publicly available as the creators of the source content have not given permission to make them publicly available in that form. For Arapaho we believe we could make the data public, but have not yet had sufficiently official confirmation of that.
Building¶
To install the prerequisites of the munge scripts:
git clone git@github.com:UAlbertaALTLab/munge.git
cd munge
npm install
Then, running them takes a little bit of fiddling because they are written in TypeScript and need to be transpiled to JavaScript. To do that on the fly:
node -r sucrase/register/ts cwdeng/cwdize.ts --help
In several of the directories there is an executable run.js script to do
that for you, so it could be as simple as ./run.js --help.
Cree linguistInfo¶
For the Plains Cree dictionary, the following linguistInfo fields are
used to display linguistic info in search results, to provide semantic class information, and for showing emoji:
inflectional_category, String: The inflectional category for an entry, with hyphen, e.g.,NI-1. (CW’s\ps)pos, String: The part of speech for this entry (N/V/PRON). If we were naming this today following our glossary, we would call it the general word class.rw_domains, list of String: The RapidWords semantic classification domain names for this entry, in the canonical form defined in both rapidwords.net and semdom.org: e.g.[ "Sleep" ]rw_indices, dictionary mapping String to a list of String: For each of the sources in the entry (using the same short abbreviations as insource), we provide the list of indices for the RapidWords semantic classification domains for the entry, in the canonical form defined in both rapidwords.net and semdom.org: e.g.:{ "CW": [ "5.7.1" ] }
stem, String: The FST stem for this entry.For Plains Cree specifically, there are two variants of linguistic stems in the ALTLab crk-db. For both, a preceding hyphen (for dependent nouns, e.g. -ohkom-) and/or following hyphen (for all stems, e.g. nimî-) indicate that they can take additional prefixes/suffixes:
the minimal CW stem from
\stmfield in the CW toolbox source, which N.B. is lacking from MD and AECD. It should be present there for all words, but blank for non-independent morphemes, and might be a list when the head is a phrase. This minimal stem may lack lexicalized reduplicative elements and/or preverbs/prenouns, and thus may not have a one-to-one mapping to possible lemmas (e.g. api- as the minimal CW stem of the lemma ay-apiw.the full FST stem according to the
fst.stemfield in the ALTLab crk-db. This includes all the reduplicative elements as well as preverbs/prenouns which have become lexicalized in a lemma, and thus has a one-to-one mapping with the lemma. This is created in the ALTLab crk-db based on the minimal stem, e.g. ay-api- as the full FST stem for the lemma ay-apiw. The FST stem, when supplemented with special morphophonological symbols, is used in the lexc source code for crk stems in the format: <FST lemma>:<FST stem>, e.g.,acitakotêw:acitakot3 VTAt ;
itwêwina currently has the FST stem in the
linguistInfo.stemfield, and does not include a separate CW stem in the importjson. If display of the minimal CW stem were some day added to morphodict, that would of course require the dictionary data to include that data at that time.wn_domains, a list of String: The WordNet semantic classifications for this entry, using the same format as in the Altlab wordnet server, e.g.,[ "(v) sleep#1", "(adv) together#4" ].wordclass, String: The word class for this entry (VTA/VAI/ etc.). At one time our glossary called this a specific word class.