A text comparison tool
A Chrome Extension that will use the Readability script to extract the substance of a news article for a white-listed set of sites, eg:
Use Murmur Hash to create ordered integer representation of news article. Use SHA-256 to create a signature for the hashes. Send the hashes and signature to Superfastmatch server. Do a search and association using the hashes and store only the results against the signature. Return the results to the extension. Future users who see the same content (not necessarily the same URL) will get the results from the cache rather than execute another search.
The results (ie. press releases) will be formatted in the browser extension with statistics, much like a Churnalism.com result.
A Firefox addon should be straight-forward once the Chrome version is complete
Initially 5 scrapers for major sources of press releases in the US. Written in Python, they will extract the publication date, url and title and content from each document and POST it to the Superfastmatch server. The press releases will be searchable within a matter of seconds.
Potentially the scrapers can check if content has been removed and DELETE the document from the index. Requires that state be stored somewhere.
Would recommend at least one Science scraper of a source such as Eureka Alert - this is an area that raises a lot of concerns in journalism.
A C++ daemon that stores documents in memory-efficient form for fast searching and ranking according to degree of duplication between search text and document text. Results are presented in faceted form according to document type, which allows for cross-corpora search.
Leverages hash collisions as a means of increasing the compression potential through the side-effect of shorter document id deltas. Uses signal processing concepts to filter out collisions at search time.
Multi-threaded indexing means that a thread is assigned a slot of the hash table and works on it independently without a shared lock. Allows for scaling in line with number of processor cores.
Asynchronous queued document submission means scrapers don't have to wait for indexing to occur
Configurable hash width means the more memory that is available to the server, the less collisions that will occur. This increases the speed of search at the cost of memory usage. Also means that lesser-spec machines can run the index with a smaller hash width.
Requires experimentation to find correct width for a corpus. Hash width can be changed without reloading the documents.
Whole index is stored in memory, which gives massive speed boost. Currently memory/corpora size ratio is about 2:1. due to Variable Length Integer Coding. Much higher compression is available if a frequency distribution of doc id deltas is used to create a Huffman-type tree.
|/||GET||Returns home page with search form||200|
|/document/||GET||Returns the first page of documents||200|
|/document/?cursor=<cursor>||GET||Returns the page of documents beginning with <cursor>||200|
|/document/<doctype>/||GET||Returns the first page of documents with <doctype>>||200|
|/document/<doctype>/?cursor=<cursor>||GET||Returns the page of documents with <doctype> beginning with <cursor>||200|
|/document/<doctype>/<docid>/||GET||Returns the document with <doctype> and <docid> if exists||200/404/304|
|/document/<doctype>/<docid>/||POST||Create or update the document with <doctype> and <docid> and defer association||202|
|/document/<doctype>/<docid>/||PUT||Create or update the document with <doctype> and <docid> and associate||202|
|/document/<doctype>/<docid>/||DELETE||Delete the document with <doctype> and <docid> and related associations||204/404|
|/index/||GET||Returns the first page of the index||200|
|/index/?cursor=<cursor>||GET||Returns the page of the index beginning with <cursor>||200|
|/association/||GET||Returns the first page of associations||200|
|/association/?cursor=<cursor>||GET||Returns the page of associations beginning with <cursor>||200|
|/association/<doctype>/||GET||Returns the the first page of associations for <doctype>||200|
|/association/<doctype>/?cursor=<cursor>||GET||Returns the the page of associations for <doctype> beginning with <cursor>||200|
|/association/<doctype>/<docid>/||GET||Returns the associations for a specific document||200/304|
|/association/||POST||Updates associations for all documents||202|
|/association/<doctype>/||POST||Updates associations for all documents with <doctype>||202|
|/association/<doctype1>/<doctype2>/||POST||Updates associations for all documents with <doctype1> against <doctype2>||202|
|/search/||POST||Returns the results for a search of all documents||200|
|/search/<doctype>/||POST||Returns the results for a search against documents of <doctype>||200|
|/status/||GET||Returns counts for each DB||200|
|/queue/||GET||Returns the first page of jobs||200|
|/queue/?cursor=<cursor>||GET||Returns the page of jobs beginning with <cursor>||200|
|/queue/<queueid>/||GET||Returns the details of <queueid>||200/404/304|
Where <doctype>, <docid> and <queueid> are integers between 1 and 4294967295 ie. (all unsigned 32 bit numbers, excluding 0)