class Text::Hyphen
An object that knows how to perform hyphenation based on the TeX hyphenation algorithm with pattern files. Each object is constructed with a specific language’s hyphenation patterns.
Constants
- DEBUG
- DEFAULT_MIN_LEFT
- DEFAULT_MIN_RIGHT
- VERSION
Attributes
Returns the language’s ISO 639 ID, e.g., “en_us” or “pt”.
The name of the language to be used in hyphenating words. This will be a two or three character ISO 639 code, with the two character form being the canonical resource name. This will load the language hyphenation definitions from text/hyphen/language/<code> as a Ruby class. The resource ‘text/hyphen/language/en_us’ defines the language class Text::Hyphen::Language::EN_US. It also defines the secondary forms Text::Hyphen::Language::EN and Text::Hyphen::Language::ENG_US.
Minimal transformations will be performed on the language code provided, such that any dashes are converted to underscores (e.g., ‘en-us’ becomes ‘en_us’) and all characters are regularised. Resource names will be downcased and class names will be converted to uppercase (e.g., ‘Pt’ for the Portuguese language becomes ‘pt’ and ‘PT’, respectively).
The language may also be specified as an instance of Text::Hyphen::Language
.
No fewer than this number of letters will show up to the left of the hyphen. The initial value for this will be specified by the language; setting this value will override the language’s defaults.
No fewer than this number of letters will show up to the right of the hyphen. This overrides the default specified in the language.
Public Class Methods
Creates a hyphenation object with the options requested. The options available are:
- language
-
The language to perform hyphenation with. See
language
andiso_language
. - left
-
The minimum number of characters to the left of a hyphenation point. See
left
. - right
-
The minimum number of characters to the right of a hyphenation point. See
right
.
The options can be provided either as hashed parameters or set as methods in an initialization block. The following initializations are all equivalent:
hyp = Text::Hyphenate.new(:language => 'en_us') hyp = Text::Hyphenate.new(language: 'en_us') # under Ruby 1.9 hyp = Text::Hyphenate.new { |h| h.language = 'en_us' }
# File lib/text/hyphen.rb, line 75 def initialize(options = {}) # :yields self: @iso_language = options[:language] @left = options[:left] @right = options[:right] @language = nil @cache = {} @vcache = {} @hyphen = {} @begin_hyphen = {} @end_hyphen = {} @both_hyphen = {} @exception = {} @first_load = true yield self if block_given? @first_load = false load_language @left ||= DEFAULT_MIN_LEFT @right ||= DEFAULT_MIN_RIGHT end
Public Instance Methods
Clears the per-instance hyphenation and visualization caches.
# File lib/text/hyphen.rb, line 173 def clear_cache! @cache.clear @vcache.clear end
Returns an array of character positions where a word can be hyphenated.
hyp.hyphenate('representation') #=> [3, 5, 8 10]
Because hyphenation can be expensive, if the word has been hyphenated previously, it will be returned from a per-instance cache.
# File lib/text/hyphen.rb, line 106 def hyphenate(word) word = word.downcase $stderr.puts "Hyphenating #{word}" if DEBUG return @cache[word] if @cache.has_key?(word) res = @language.exceptions[word] return @cache[word] = make_result_list(res) if res letters = word.scan(@language.scan_re) $stderr.puts letters.inspect if DEBUG word_size = letters.size result = [0] * (word_size + 1) right_stop = word_size - @right updater = Proc.new do |hash, str, pos| if hash.has_key?(str) $stderr.print "#{pos}: #{str}: #{hash[str]}" if DEBUG hash[str].scan(@language.scan_re).each_with_index do |cc, ii| cc = cc.to_i result[ii + pos] = cc if cc > result[ii + pos] end $stderr.print ": #{result.inspect}\n" if DEBUG end end # Walk the word (0..right_stop).each do |pos| rest_length = word_size - pos (1..rest_length).each do |length| substr = letters[pos, length].join('') updater[@language.hyphen, substr, pos] updater[@language.start, substr, pos] if pos.zero? updater[@language.stop, substr, pos] if (length == rest_length) end end updater[@language.both, word, 0] if @language.both[word] (0..@left).each { |i| result[i] = 0 } ((-1 - @right)..(-1)).each { |i| result[i] = 0 } @cache[word] = make_result_list(result) end
This function will hyphenate a word so that the first point is at most
NOTE: if hyphen is set to a string, it will still be counted as one character (since it represents a hyphen)
size
characters.
# File lib/text/hyphen.rb, line 184 def hyphenate_to(word, size, hyphen = '-') point = hyphenate(word).delete_if { |e| e >= size }.max if point.nil? [nil, word] else [word[0 ... point] + hyphen, word[point .. -1]] end end
Returns a string describing the structure of the patterns for the language of this hyphenation object.
# File lib/text/hyphen.rb, line 195 def stats _b = @language.both.size _s = @language.start.size _e = @language.stop.size _h = @language.hyphen.size _x = @language.exceptions.size _T = _b + _s + _e + _h + _x s = <<-EOS The language '%s' contains %d total hyphenation patterns. % 6d patterns are word start patterns. % 6d patterns are word stop patterns. % 6d patterns are word start/stop patterns. % 6d patterns are normal patterns. % 6d patterns are exceptions. EOS s % [ @iso_language, _T, _s, _e, _b, _h, _x ] end
Returns a visualization of the hyphenation points.
hyp.visualize('representation') #=> rep-re-sen-ta-tion
Any string can be set instead of the default hyphen:
hyp.visualize('example', '­') #=> exam­ple
Because hyphenation can be expensive, if the word has been visualised previously, it will be returned from a per-instance cache.
# File lib/text/hyphen.rb, line 159 def visualise(word, hyphen = '-') return @vcache[word] if @vcache.has_key?(word) w = word.dup s = hyphen.size hyphenate(w).each_with_index do |pos, n| # Insert the hyphen string at the ported position plus the offset of # the last hyphen string inserted. w[pos.to_i + (n * s), 0] = hyphen unless pos.zero? end @vcache[word] = w end
Private Instance Methods
# File lib/text/hyphen.rb, line 235 def load_language return if @first_load @iso_language ||= "en_us" unless @language require "text/hyphen/language/#{@iso_language}" @language = Text::Hyphen::Language.const_get(@iso_language.upcase) @iso_language = @language.isocode if @language.isocode end @left ||= @language.left @right ||= @language.right @iso_language end
# File lib/text/hyphen.rb, line 228 def make_result_list(res) r = [] res.each_with_index { |c, i| r << i * (c.to_i % 2) } r.reject { |i| i.to_i == 0 } end
# File lib/text/hyphen.rb, line 216 def updateresult(hash, str, pos) if hash.has_key?(str) STDERR.print "#{pos}: #{str}: #{hash[str]}" if DEBUG hash[str].scan(@language.scan_re).each_with_index do |c, i| c = c.to_i @result[i + pos] = c if c > @result[i + pos] end STDERR.puts ": #{@result}" if DEBUG end end