A novel scheme is introduced to capture the spatial correlations of
consecutive amino acids in naturally occurring proteins. This knowledge
based strategy is able to carry out optimally automated subdivisions of
protein fragments into classes of similarity. The goal is to provide the
minimal set of protein oligomers (termed "oligons" for brevity) that is
able to represent any other fragment. At variance with previous studies
in which recurrent local motifs were classified, our concern is to
provide simplified protein representations that have been optimised for
use in automated folding and/or design attempts. In such contexts, it is
paramount to limit the number of degrees of freedom per amino acid
without incurring loss of accuracy of structural representations. The
suggested method finds, by construction, the optimal compromise between
these needs. Several possible oligon lengths are considered. It is shown
that meaningful classifications cannot be done for lengths greater than
six or smaller than four. Different contexts are considered for which
oligons of length five or six are recommendable. With only a few dozen
oligons of such length, virtually any protein can be reproduced within
typical experimental uncertainties. Structural data for the oligons are
made publicly available.