How to Write a Spelling Corrector in Erlang (ESpell)

Erlang is a beautiful programming language from Ericsson which i first came across while cutomizing authentication flow of ejabberd about 2 years back. Ever since then I have been using erlang for all my application backend needs including custom http server, custom bosh conn. manager, xmpp components and clients, … Recently i have even started churning my application html pages via erlang using erlydtl (an Erlang implementation of the Django Template Language).

Years ago, i gave a successful shot at implementing Peter Norvig’s Spell Corrector in PHP. Last weekend i attempted the same “Spell Corrector” algorithm in about 45 lines of Erlang code.

ESpell:

Complete code file with comments and explaination can be found here:
https://github.com/abhinavsingh/espell

-module(espell).
-define(alphabet, "abcdefghijklmnopqrstuvwxyz").
-export([start/0, correct/1]).

%%
%% API Functions
%%

%% @doc start spell checker
start() -> train(words()).

%% @doc returns most probable correct candidate with score
correct(Word) ->
	lists:foldl(
		fun(Candidate, {Correction, Score}) -> 
			case ets:lookup(?MODULE, list_to_binary(Candidate)) of
				[{_, Counter}] when Counter > Score -> {Candidate, Counter};
				_ -> {Correction, Score}
			end
		end, {Word, 0}, get_candidates(Word)).

%%
%% Local Functions
%%

words() ->
	{ok, Bin} = file:read_file("../priv/big.txt"),
	{ok, Words} = regexp:split(binary_to_list(Bin), "[^a-zA-Z]"),
	lists:map(fun(X) -> string:to_lower(X) end, Words).

train(Features) ->
	io:fwrite("training initial word list...~n"),
	ets:new(?MODULE, [set, named_table]),
	lists:foreach(fun(X) ->
		case ets:insert_new(?MODULE, {list_to_binary(X), 1}) of
			false -> ets:update_counter(?MODULE, list_to_binary(X), 1);
			true -> true
		end
	end, Features),
	io:fwrite("training complete...~n"),
	ok.

edits1(Word) ->
	Splits = lists:foldl(fun(I, Acc) -> Acc ++ [{string:substr(Word, 1, I), string:substr(Word, I+1, string:len(Word)-I)}] end, [{"", Word}], lists:seq(1, string:len(Word))),
	Deletes = [A ++ string:substr(B, 2) || {A,B} <- Splits, B =/= []],
	Transposes = [A ++ string:substr(B, 2, 1) ++ string:substr(B, 1, 1) ++ string:substr(B, 3) || {A,B} <- Splits, string:len(B) > 1],
	Replaces = [A ++ binary_to_list(<>) ++ string:substr(B, 2) || {A,B} <- Splits, B =/= [], C <- ?alphabet],
	Inserts = [A ++ binary_to_list(<>) ++ B || {A,B} <- Splits, C <- ?alphabet],
	lists:usort(Deletes ++ Transposes ++ Replaces ++ Inserts).

%%edits2(Word) -> lists:usort([E2 || E1 <- edits1(Word), E2 <- edits1(E1)]).
known_edits2(Word) -> lists:usort([E2 || E1 <- edits1(Word), E2 <- edits1(E1), ets:member(?MODULE, list_to_binary(E2))]).
known(Words) -> lists:usort([Word || Word <- Words, ets:member(?MODULE, list_to_binary(Word))]).

get_candidates(Word) ->
	C1 = known([Word]),
	if 
		length(C1) > 0 -> C1;
		true -> C2 = known(edits1(Word)),
			if
				length(C2) > 0 -> C2;
				true ->	C3 = known_edits2(Word),
					if length(C3) > 0 -> C3; 
					true -> [Word] end
			end
	end.

Try It Out:
espell provides 2 simple function for all it’s working:

  • start() : start espell which initiates reading initial data and training phase
  • correct(Word) : this accepts 1 parameter, which is the word you want to correct. It returns a 2-tuple, where 1st element is the correct word and 2nd element is a score (which right now simply means number of times correct word was seen in training data set)
$ cd espell
$ erlc -o ebin/ src/espell.erl
$ erl -pa ebin/
Erlang R14B03 (erts-5.8.4) [source] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.8.4  (abort with ^G)
1> espell:start().
training initial word list...
training complete...
ok
2> espell:correct("speling").
{"spelling",4}
5> espell:correct("somthing").
{"something",683}

This code makes extensive use of list comprehensions in erlang, which is hugely responsible for cutting down espell code to just 45 lines of erlang.

  • hey good to see you back on blogging!
    and on erlang! that’s wonderful.
    will look fwd for more erlang.