Note: Aspell International Support is about to under go a major rewrite. Please see http://metalab.unc.edu/kevina/aspell/international/ for more information. The information presented here will be outdated very soon.
Even though Aspell is designed around the English language Aspell will do OK with other non-English languages provided that it doesn't have an extremely large dictionary (say over a megabyte of two in size) or have a lot of affication (to the point where affix compression will shrink the size over 50%). If the language has a large dictionary or a lot affication Aspell will work but it will take up a lot space due to the way Aspell indexes the words (see 6) and the fact that Aspell currently lacks any sort of affix compression (seeB.7.1 ).
Support for other language can either be added at run time through a language data file or at compile time.
Languages can be added at will through the use of a language data file. The file name must be in the same directory that the word list(s) are and it must be named <language>.lang where <language> is the name of the language you are added support for.
The data file consists of three blocks of information inclosed in braces. Any information out side of the braces is ignored. The white space before and after the braces is mandatory.
The first block of information contains the upper to lower case mapping. It conceits of lower/upper case pairs of letters with white space between them. For example here is the case mapping for English:
{ aA bB cC dD eE fF gG hH iI jJ kK lL mM nN oO pP qQ rR sS tT uU vV wW xX yY zZ }If a character is the same in both upper and lower case than repeat twice such as ``kk''. Failure to do so will result in an error. Also as I said before the white space before and after the braces ({ })is mandatory.
The second block of information contains a list of the vowel or vowel like characters in lower case. For example the second block for English would be:
{ a e i o u y }
The last block of information contains a list of other characters which are not part of the alphabet but can nevertheless appear within a valid word. The final block for English would be:
{ ' }
For you reference here is what the complete english.lang file looks like
Language File for english
Case Block { aA bB cC dD eE fF gG hH iI jJ kK lL mM nN oO pP qQ rR sS tT uU vV wW xX yY zZ }
Vowel Block { a e i o u y }
Other Characters Block { ' }
Once you created the data file you need to pass the dictionary through ``aspell master'' to properly prepare it using the new language. Now just make sure the word list and the language data file are in the same directory.
Once you have used the new language for a while please consider sending me a copy of the data file so that I can include it in future versions.
More complete support for a language can be added by writing some code and recompiling the source file. In order to do this you should have the latest version of automake, autoconf, and libtools installed as the Makefile is going to need to be recreated.
The easiest way to get started is to write the language data file first and use the aspell utility to create most of the code for you. The usage is
aspell lang [<path>]<lang>Where <path> is the optional fully qualified directory name of the location of the language data file and <lang> is the name of the lang. There should be no space between <path> and <lang>.
This will create to file asl_<lang>.hh and asl_<lang>.cc containing all the code you need to compile in support for the language. However, in order to get Aspell to recognize the new language you need to modify the file language.cc in two places. You need to include the the asl_<lang>.hh file and you need to add the language to the lookup static variable. The line to add should look like this:
lookup_pair("<lang>", new_SC_<Lang>)Where <Lang> is <lang> with the first letter capitalized. For example to add support for a French language you would say:
lookup_pair(``french'', new_SC_French)This line can go anywhere in the table however I recommend that you add it after the last entry. Just be sure you remember the list still has all the necessary commas.
Finally you need to add the file asl_<lang>.cc to the end of the libspell_la_SOURCES variable in Makefile.am and then type make. All the necessary files re be recreated automatically provided that you have the proper tools installed.
Once you have successfully used the compiled in language you can start experimenting with fine tuning it by overriding virtual methods in the SC_Language class.
The SC_Language class is the base class for language support all language class must be derived from this class.
// other irrelevant non virtual methods
};
All of the protected data members must be given a value by the derived class as the public methods relay on them. The ``aspell lang'' utility will take care of this for you so for most cases you don't need to worry about them.
This data members needs to point to a null terminated string containing the name of the current language.
This data member needs to point to a 256 character long character array which maps the upper case characters to the lower case. A static_cast<unsigned char> is performed on the character before it is looked up so that a signed value of -1 would become 128. If the character c is an upper case character than to_lower_[static_cast<unsigned char>(c)] needs to contain c in lower case. If c is not in upper case then it needs to contain c.
The same as to_lower but it maps lower case characters to upper case.
Similar to to_lower_ and to_upper_ except that is_alpha_[static_cast<unsigned char>(c)] need to be false (0) if c is a non-word character and true (anything but 0) otherwise.
In addition if the to_soundslike method is not overridden c needs to be SC_Language::consonant if c is a consonant, and SC_Language::vowel is c is a vowel. If the trim_n_try method is not overridden c needs to be SC_Language::special if c is a non-alpha characters that can appear as part of the word, such as the appophes (') in english.
Needs to contain a null terminated array of characters which contains all of the characters that can appear in a to_soundslike string. If the to_soundslike method is not overridden this will be all the lower case consonant.
The destructor must be defined if your class uses any dramatically allocated memory as the SC_Language class destructor does not delete anything.
These methods only have to be overridden if you are unhappy with job they do. const_string is a very limited version of the string class. It has an iterator and can be used like a random access container however it doesn't have any of the fancy string methods such as find and substr.
This method needs to return a string which represents what the word roughly sounds like.
This method needs to return a string which represents the phoneme for the word.
Needs to return true if the to_phoneme method is overloaded.
This method needs to study the string and return an integer which represents the case pattern (such as all uppercase, first letter uppercase, etc..)
This method needs to fix the case of word so that it has the same case pattern as pattern and return the new word.
This method should try to trim special characters (such as the apposhes in english) from the word and then see if it is a valid word. If it can find a valid word by trimming it should return true. Otherwise it should return false.
To avoid infinite recursion this methods should not call aspell::check as aspell::check calls this method. Use aspell::check_notrim instead (aspell:check_raw should not be used as it doesn't not try to change the case of the word thus 'Do' would come back false)
Both the copy constructor and the assignment operator are private so that you don't have to worry about copies being made.