Superfastmatch
A text comparison tool

Donovan Hide
September 2011
http://github.com/mediastandardstrust/superfastmatch
http://groups.google.com/group/superfastmatch

Browser Extension Specification

A Chrome Extension that will use the Readability script to extract the substance of a news article for a white-listed set of sites, eg:

Use Murmur Hash to create ordered integer representation of news article. Use SHA-256 to create a signature for the hashes. Send the hashes and signature to Superfastmatch server. Do a search and association using the hashes and store only the results against the signature. Return the results to the extension. Future users who see the same content (not necessarily the same URL) will get the results from the cache rather than execute another search.

Browser Extension Specification II

The results (ie. press releases) will be formatted in the browser extension with statistics, much like a Churnalism.com result.

A Firefox addon should be straight-forward once the Chrome version is complete

Scraper Specification

Initially 5 scrapers for major sources of press releases in the US. Written in Python, they will extract the publication date, url and title and content from each document and POST it to the Superfastmatch server. The press releases will be searchable within a matter of seconds.

Potentially the scrapers can check if content has been removed and DELETE the document from the index. Requires that state be stored somewhere.

Would recommend at least one Science scraper of a source such as Eureka Alert - this is an area that raises a lot of concerns in journalism.

Superfastmatch Specification

A C++ daemon that stores documents in memory-efficient form for fast searching and ranking according to degree of duplication between search text and document text. Results are presented in faceted form according to document type, which allows for cross-corpora search.

Leverages hash collisions as a means of increasing the compression potential through the side-effect of shorter document id deltas. Uses signal processing concepts to filter out collisions at search time.

Multi-threaded indexing means that a thread is assigned a slot of the hash table and works on it independently without a shared lock. Allows for scaling in line with number of processor cores.

Asynchronous queued document submission means scrapers don't have to wait for indexing to occur

Superfastmatch Specification II

Configurable hash width means the more memory that is available to the server, the less collisions that will occur. This increases the speed of search at the cost of memory usage. Also means that lesser-spec machines can run the index with a smaller hash width.

Requires experimentation to find correct width for a corpus. Hash width can be changed without reloading the documents.

Whole index is stored in memory, which gives massive speed boost. Currently memory/corpora size ratio is about 2:1. due to Variable Length Integer Coding. Much higher compression is available if a frequency distribution of doc id deltas is used to create a Huffman-type tree.

Superfastmatch dependencies

REST Specification

URL METHOD DESCRIPTION RESPONSE
/ GET Returns home page with search form 200
/document/ GET Returns the first page of documents 200
/document/?cursor=<cursor> GET Returns the page of documents beginning with <cursor> 200
/document/<doctype>/ GET Returns the first page of documents with <doctype>> 200
/document/<doctype>/?cursor=<cursor> GET Returns the page of documents with <doctype> beginning with <cursor> 200
/document/<doctype>/<docid>/ GET Returns the document with <doctype> and <docid> if exists 200/404/304
/document/<doctype>/<docid>/ POST Create or update the document with <doctype> and <docid> and defer association 202
/document/<doctype>/<docid>/ PUT Create or update the document with <doctype> and <docid> and associate 202
/document/<doctype>/<docid>/ DELETE Delete the document with <doctype> and <docid> and related associations 204/404
/index/ GET Returns the first page of the index 200
/index/?cursor=<cursor> GET Returns the page of the index beginning with <cursor> 200

REST Specification Part II

URL METHOD DESCRIPTION RESPONSE
/association/ GET Returns the first page of associations 200
/association/?cursor=<cursor> GET Returns the page of associations beginning with <cursor> 200
/association/<doctype>/ GET Returns the the first page of associations for <doctype> 200
/association/<doctype>/?cursor=<cursor> GET Returns the the page of associations for <doctype> beginning with <cursor> 200
/association/<doctype>/<docid>/ GET Returns the associations for a specific document 200/304
/association/ POST Updates associations for all documents 202
/association/<doctype>/ POST Updates associations for all documents with <doctype> 202
/association/<doctype1>/<doctype2>/ POST Updates associations for all documents with <doctype1> against <doctype2> 202
/search/ POST Returns the results for a search of all documents 200
/search/<doctype>/ POST Returns the results for a search against documents of <doctype> 200
/status/ GET Returns counts for each DB 200

REST Specification Part III

URL METHOD DESCRIPTION RESPONSE
/queue/ GET Returns the first page of jobs 200
/queue/?cursor=<cursor> GET Returns the page of jobs beginning with <cursor> 200
/queue/<queueid>/ GET Returns the details of <queueid> 200/404/304

Where <doctype>, <docid> and <queueid> are integers between 1 and 4294967295 ie. (all unsigned 32 bit numbers, excluding 0)