Skip Navigation LinksHome > Case Studies > Web Crawler and Analyser

WeCA

Our Web Crawler and Analyzer (WeCA) allows you to amass and categorise vast amounts of data, fast

Web Crawler and Analyser

  • Initial Tests collected and categorised 1.6 million documents in 36 hours

  • Diverse Technical Skills Used

  • Client

    Sys Consulting Ltd et al
  • Overview

    Shore Tech Systems provided expertise advice and custom software for a web crawling, harvesting and data-mining solution.
  • Technologies Used

    Database Design and Architecture Custom web crawler Custom File Parser Generating PDF files for reporting and exporting data Generating Excel files for reporting and exporting data

Shore Tech Systems provided expertise advice and custom software for a web crawling, harvesting and data-mining solution. The system was capable of collecting over 1.5 million documents in a 36 hour period (on a single server configuration); this resulted is a MS-SQL database of 1TB in size.

The brief was to be able to collect a subset of the data available publicly on the World Wide Web and store it and all its relationships in a manageable and transportable manner.

Proof of Concept

The development process for this project was guaranteed to be highly iterative as the client demanded as Proof of Concept application to be able to build a general understanding of the data available and its potential quality. After successfully proving that the clients initial ideas would be able to be realised, the system blue print, and then specification took shape:

  • Harvest data from the WWW
  • Build relationships and paths between the data
  • Relational Database as a data repository
  • Examine data patterns from within .NET
  • MATLAB integration with .NET to expose ability to further examine data patters
  • MATLAB integration directly with relational database to perform large database operations directly
  • Ability to retrieve data and patterns in a timely fashion from the data repository

Harvest and Data Store

The project initially divided into two distinct sections; the design and implementation of a harvester and the design and implementation of the relational database schema.

Harvester

With the prospect of having to deal with at least many 100's of gigabytes of data per day, the harvester design had to overcome bottlenecks in numerous locations. By dividing the harvester's tasks into smaller work modules it enabled the harvester to be fully multithreaded with the option of spitting it different workloads across multiple servers if so required.

Major bottle necks encountered were:

  • Downloading files in a way to maximise the use of the available internet bandwidth
  • Managing queuing systems whist keeping memory footprint to a minimum
  • Processing files of different formats to provide a ubiquitous way of over viewing all data

Data Schema

As the iterative process with the client continued, there realisation of how much data could be gathered about each document and the relationships between them, the size of the database schema increased drastically. The end result was a database with over 50 objects with many requiring updating for each new document added to the data repository.

Data Access

The tie between the harvester and the data repository was in a layer initially written entirely in .NET. However, after the first testing round of the system it was clear that the high number of Transactions per Second was saturating the initial implementation. This prompted the move to encapsulate as much as possible with in stored procedures and to slim all pre-storage processing down to a minimum. The result of this continual iterative process was a 10 fold increase in the number of documents stored per second and the limiting bottle neck was moved from the processor to the disk write speed of the underlying hardware.

Data Manipulation

The client's specification included the ability to manipulate the data in a number of ways from MatLab. The two options for integrating the data with MatLab are: The MatLab database toolbox to interface with the database directly; and creating a MatLab compiled library that .NET can reference and call procedures within it.

.NET Manipulation

By creating MatLab compiled code the data was able to be manipulated from the managed .NET programming environment. This opened the ability to create Graphical User Interfaces (GUI's) that could display the wealth of information gathered in an informative and structured manner.

Matlab Integration

The client requested the ability to be able to work with the data directly in MatLab. The key to this was generating a small wrapper of MatLab class files that exposed the underlying data in such a way that future analysis would need minimal configuration.

Data Extraction

Extracting data from databases that grow to be this large so quickly becomes increasingly trickier unless the database is designed to be as efficient at retrieving data (without too seriously negativity impacting the total harvest time).

A basic linear classifier implemented directly in T-SQL to scan the entire dataset to retrieve specific information ran in sub 10 seconds which was considerably better than what the client was expecting (somewhere in the range of 25 - 30 minutes).

Future Use

The exact nature of the project cannot be discussed but we are ready to utilise the technology developed for more extensive commercial use.