The paper discusses the expansion of the Gigafida corpus, a Slovenian reference corpus, to include Internet content, i.e. web pages and user-generated
content (tweets, blogs, forums and comments on news portals). The resources
and tools available which are best suited to achieve this objective are discussed,
and the web crawling methodology used for this purpose is also presented.