Meet the team
crawler

I crawl websites according to crawl policies you define.

I'm adept enough to handle pretty much any crawl logic you'd want, for example:

  • only allow URLs matching a regex rule or a blacklist
  • don't follow links which have a nofollow attribute
  • only crawl the first 10000 URLs I find

crawls the websites you specify

extracter

Borne from the work done for Homethinking, I am an XML-based data manipulation framework.

I help you map between datasources using regex rules. I have all kinds of validation and conversion goodness you can use to massage your data.

A datasource, in this case, can be a HTML file, an XML file, a database, or even a Solr index.

In the typical vertical search use-case, data is extracted from a HTML file into a database.

extracts data from URLs into a database

searcher

I'm Powered by Apache Solr. Nuff said.

Used by a number of large consumer internet destinations, Solr is fast, robust and extensible.

Additionally, I'm pimped out with custom Solr request handlers which give you goodies like:

  • spatial searching
  • query autocompletion
  • query expansion

tricked-out version of Apache Solr

Together we provide the complete backend infrastructure required to create and run a vertical search engine.

Why its cool
Faster
Flexile allows you to launch a vertical search engine in 3-6 weeks instead of as many months.


Safer
The infrastructure is tried-and-tested. Currently in production in a number of products.


Cheaper
Flexible licensing options, from yearly subscriptions to upfront royalty-free licenses.


Focus
You get to focus on what matters most: the front-end user experience.

Who is it for?

Flexile is designed with one type of person in mind:

A business-savvy individual wanting to rapidly launch a vertical search engine without undertaking the risky process of hiring programmers to build the backend from scratch.

Well, maybe this type of person might find it useful too...

An existing company/team with front-end PHP/ASP/Ruby/Python chops wanting to launch a product which requires crawling and searching.
The Secret Recipe

First, we determine which websites need to crawled.

Then, the crawl policy for each website (or a group of websites) is defined using Flexile's XML-based crawl schema.

Elements of the crawl schema include:

  • Seed URLs - where the crawler should look to obtain the list of URLs to crawl, e.g. text file, database etc
  • Parse rules - how the crawler should discover new links from crawled URLs, e.g. all links in HTML, regex patterns etc
  • URL filters - how the crawler should determine which URLs to crawl, e.g. same host, regex patterns, per-host limits etc

For each of the crawl components such as URL filtering, parsing, URL normalization etc, we'll either use stock Flexile components or coding a custom implementation for your needs.

The extraction phase is only used when some metadata is required to be extracted from the HTML files.

For example, if you're building a job vertical search engine, we'll define extraction rules for fields such as:

  • Title
  • Salary
  • Location
  • Company

Flexile provides a simple and straightforward way to map regex patterns to metadata columns.

Additionally, transformers and validators can be applied to individual columns before outputting them to a DB, or Solr.

The 2 main parts of Solr which need to be configured are the schema.xml and solrconfig.xml files.

The fields defined in the schema will depend on your vertical search app.

As part of the initial customization phase, we'll determine your needs and make the necessary configuration.

If custom Solr components need to be written, they'll be declared in solrconfig.xml

Flexile also supports spatial searches if required by your app.

Crawls can be launched manually or via cron jobs.

At the end of each crawl run, Flexile can write a crawl summary to a MySQL database.

This bit we won't help you with.

We'll be happy to work with your front-end UI wizards to integrate with the searcher and crawler.

Flexile imposes no restrictions on your choice of front-end language or framework.

Popular choices include Ruby, PHP, ASP, Python and JSP/Servlet.

Through Apache Solr, Flexile exposes a HTTP REST interface to your front-end app.

Data formats supported include: XML, JSON, BinaryJava, Serialized PHP etc

Requirements
  • Linux
  • MySQL
  • Sun Java 6
  • Front-end UI developer(s)
  • Dedicated Crawl Master
  • System administrator

Flexile runs off your hosting infrastructure. It is NOT a hosted service.

FAQ

Vertical what?

A vertical search engine, as opposed to a web search engine, provides search results on a particular vertical slice of the internet.

It usually provides a richer search experience because it has privileged knowledge of its domain.

Examples of vertical search engines you've probably already used are Zillow.com (property), Indeed.com (jobs) and YellowPages.com (white pages).

Can Flexile only be used for vertical search?

Nope.

A sample of previous uses of the Flexile crawler include:

  • performing Google keyword competitive analysis
  • determining domain reputation using PageRank, AlexaRank and other metrics
  • determining all known trade/manufacturer names of a given biomedical active ingredient

A sample of previous uses of the Flexile extracter include:

  • building a comprehensive database of European Commission EPAR (drug approvals)
  • importing PubMed/Medline XML data to a MySQL database
  • extracting USPO post office locations and contact information

Is it customizable?

Hell yes!

Unlike most other crawlers, Flexile was built to be embeddable and completely customizable. Every part of the crawl, extract and search phases can be customized.

The Flexile crawler exposes an Event API which programmers can use to hook into the crawl engine. Examples of crawl events are: beforeCrawl, urlQueued, urlFound, urlFetched, etc.

Through these events, programmers can write custom crawl components which customize the behavior of the crawler.

How is Flexile different from Nutch?

Nutch is web-scale, and difficult to customize.

Flexile is not web-scale, but completely customizable. Flexile has been known to scale to millions of URLs.

In Nutch:

  • There's no easy way to specify per-site crawl rules
  • There's no logic for SERP and detail URL crawling (under which 99% of large database-driven sites fall under)
  • No pluggable component for seeding URLs other than from text file, e.g. from database or from incrementing counters
  • No way of specifying cookies or bypassing cookies altogether
  • No way of performing pre-crawl procedures, such as logging in to a site to obtain session cookies etc
  • No way of fetching javascript-based paging or link mechanisms

And the list kinda goes on.

Furthermore, with large sites like monster.com with million+ listings, you need to focus the crawling to the bare minimum to get to the jobs instead of crawling the whole site. Otherwise, the turnaround time between crawls is too large and your search results will appear out-dated.

Theoretically, with Hadoop and Nutch, any customization is possible. Practically, speak to anyone who's tried to change Nutch in a non-trivial way and you'll know why Flexile exists.

Can Flexile crawl site X?

The general rule of thumb is: any stateless RESTful URL can be crawled by Flexile.

Can Flexile crawl password-protected websites?

Yes! Flexile can log-in to a site and obtain cookies before the crawl starts. This cookie will then be used for subsequent requests to the site.

It sounds very labor-intensive to define rules for each site. Can't this be automated?

Ah... the holy grail of screen-scraping, (semi-)automated crawling and extraction.

Sounds great.. but Flexile is not there yet. If that's what you're looking for, you need to look elsewhere.

From experience, there are quite a large number of sites from which data needs to be extracted which is non-standard and for which information extraction techniques probably will fail. However, IE is not our area of expertise.

All this sounds great! Is it open-source? Where can I download the binaries/source?

As much as we are proponents of open-source software, Flexile is, unfortunately, currently NOT open-source software.

The Flexile vertical search platform was built from the experience launching products like Homethinking.com and Indeed.com, and is the only comprehensive vertical search platform available on the market for which the full source code is available for purchase.

Flexile is available on a binary yearly license, or a one-time perpetual source license. Pls drop us a mail at to inquire about pricing details. Generous discounts are available for academic institutions and non-profit corporations.

Custom crawling and data extraction

If you need a custom crawler built and/or some data extracted from websites, do get in touch with us. Building crawlers is our passion.

We've crawled the impossible and that makes us mighty!

Interested to find out more?

Drop us a mail at to find out how Flexile can jump-start your vertical search project.