Our Web Crawler and Analyzer (WeCA) allows you to amass and categorise vast amounts of data, fast
Shore Tech Systems provided expertise advice and custom software for a web crawling, harvesting and data-mining solution. The system was capable of collecting over 1.5 million documents in a 36 hour period (on a single server configuration); this resulted is a MS-SQL database of 1TB in size.
The brief was to be able to collect a subset of the data available publicly on the World Wide Web and store it and all its relationships in a manageable and transportable manner.
The development process for this project was guaranteed to be highly iterative as the client demanded as Proof of Concept application to be able to build a general understanding of the data available and its potential quality. After successfully proving that the clients initial ideas would be able to be realised, the system blue print, and then specification took shape:
The project initially divided into two distinct sections; the design and implementation of a harvester and the design and implementation of the relational database schema.
With the prospect of having to deal with at least many 100's of gigabytes of data per day, the harvester design had to overcome bottlenecks in numerous locations. By dividing the harvester's tasks into smaller work modules it enabled the harvester to be fully multithreaded with the option of spitting it different workloads across multiple servers if so required.
Major bottle necks encountered were:
As the iterative process with the client continued, there realisation of how much data could be gathered about each document and the relationships between them, the size of the database schema increased drastically. The end result was a database with over 50 objects with many requiring updating for each new document added to the data repository.
The tie between the harvester and the data repository was in a layer initially written entirely in .NET. However, after the first testing round of the system it was clear that the high number of Transactions per Second was saturating the initial implementation. This prompted the move to encapsulate as much as possible with in stored procedures and to slim all pre-storage processing down to a minimum. The result of this continual iterative process was a 10 fold increase in the number of documents stored per second and the limiting bottle neck was moved from the processor to the disk write speed of the underlying hardware.
The client's specification included the ability to manipulate the data in a number of ways from MatLab. The two options for integrating the data with MatLab are: The MatLab database toolbox to interface with the database directly; and creating a MatLab compiled library that .NET can reference and call procedures within it.
By creating MatLab compiled code the data was able to be manipulated from the managed .NET programming environment. This opened the ability to create Graphical User Interfaces (GUI's) that could display the wealth of information gathered in an informative and structured manner.
The client requested the ability to be able to work with the data directly in MatLab. The key to this was generating a small wrapper of MatLab class files that exposed the underlying data in such a way that future analysis would need minimal configuration.
Extracting data from databases that grow to be this large so quickly becomes increasingly trickier unless the database is designed to be as efficient at retrieving data (without too seriously negativity impacting the total harvest time).
A basic linear classifier implemented directly in T-SQL to scan the entire dataset to retrieve specific information ran in sub 10 seconds which was considerably better than what the client was expecting (somewhere in the range of 25 - 30 minutes).
The exact nature of the project cannot be discussed but we are ready to utilise the technology developed for more extensive commercial use.