Algorithms for duplicate document and question detection/classification, implemented as part of a project