How to Write a Spelling Corrector in Erlang (ESpell)

Erlang is a beautiful programming language from Ericsson which i first came across while cutomizing authentication flow of ejabberd about 2 years back. Ever since then I have been using erlang for all my application backend needs including custom http server, custom bosh conn. manager, xmpp components and clients, … Recently i have even started churning my application html pages via erlang using erlydtl (an Erlang implementation of the Django Template Language).

Years ago, i gave a successful shot at implementing Peter Norvig’s Spell Corrector in PHP. Last weekend i attempted the same “Spell Corrector” algorithm in about 45 lines of Erlang code.

ESpell:

Complete code file with comments and explaination can be found here:
https://github.com/abhinavsingh/espell

-module(espell).
-define(alphabet, "abcdefghijklmnopqrstuvwxyz").
-export([start/0, correct/1]).

%%
%% API Functions
%%

%% @doc start spell checker
start() -> train(words()).

%% @doc returns most probable correct candidate with score
correct(Word) ->
	lists:foldl(
		fun(Candidate, {Correction, Score}) -> 
			case ets:lookup(?MODULE, list_to_binary(Candidate)) of
				[{_, Counter}] when Counter > Score -> {Candidate, Counter};
				_ -> {Correction, Score}
			end
		end, {Word, 0}, get_candidates(Word)).

%%
%% Local Functions
%%

words() ->
	{ok, Bin} = file:read_file("../priv/big.txt"),
	{ok, Words} = regexp:split(binary_to_list(Bin), "[^a-zA-Z]"),
	lists:map(fun(X) -> string:to_lower(X) end, Words).

train(Features) ->
	io:fwrite("training initial word list...~n"),
	ets:new(?MODULE, [set, named_table]),
	lists:foreach(fun(X) ->
		case ets:insert_new(?MODULE, {list_to_binary(X), 1}) of
			false -> ets:update_counter(?MODULE, list_to_binary(X), 1);
			true -> true
		end
	end, Features),
	io:fwrite("training complete...~n"),
	ok.

edits1(Word) ->
	Splits = lists:foldl(fun(I, Acc) -> Acc ++ [{string:substr(Word, 1, I), string:substr(Word, I+1, string:len(Word)-I)}] end, [{"", Word}], lists:seq(1, string:len(Word))),
	Deletes = [A ++ string:substr(B, 2) || {A,B} <- Splits, B =/= []],
	Transposes = [A ++ string:substr(B, 2, 1) ++ string:substr(B, 1, 1) ++ string:substr(B, 3) || {A,B} <- Splits, string:len(B) > 1],
	Replaces = [A ++ binary_to_list(<>) ++ string:substr(B, 2) || {A,B} <- Splits, B =/= [], C <- ?alphabet],
	Inserts = [A ++ binary_to_list(<>) ++ B || {A,B} <- Splits, C <- ?alphabet],
	lists:usort(Deletes ++ Transposes ++ Replaces ++ Inserts).

%%edits2(Word) -> lists:usort([E2 || E1 <- edits1(Word), E2 <- edits1(E1)]).
known_edits2(Word) -> lists:usort([E2 || E1 <- edits1(Word), E2 <- edits1(E1), ets:member(?MODULE, list_to_binary(E2))]).
known(Words) -> lists:usort([Word || Word <- Words, ets:member(?MODULE, list_to_binary(Word))]).

get_candidates(Word) ->
	C1 = known([Word]),
	if 
		length(C1) > 0 -> C1;
		true -> C2 = known(edits1(Word)),
			if
				length(C2) > 0 -> C2;
				true ->	C3 = known_edits2(Word),
					if length(C3) > 0 -> C3; 
					true -> [Word] end
			end
	end.

Try It Out:
espell provides 2 simple function for all it’s working:

  • start() : start espell which initiates reading initial data and training phase
  • correct(Word) : this accepts 1 parameter, which is the word you want to correct. It returns a 2-tuple, where 1st element is the correct word and 2nd element is a score (which right now simply means number of times correct word was seen in training data set)
$ cd espell
$ erlc -o ebin/ src/espell.erl
$ erl -pa ebin/
Erlang R14B03 (erts-5.8.4) [source] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.8.4  (abort with ^G)
1> espell:start().
training initial word list...
training complete...
ok
2> espell:correct("speling").
{"spelling",4}
5> espell:correct("somthing").
{"something",683}

This code makes extensive use of list comprehensions in erlang, which is hugely responsible for cutting down espell code to just 45 lines of erlang.

One thought on “How to Write a Spelling Corrector in Erlang (ESpell)”

Comments are closed.