I crawl websites according to crawl policies you define.
I'm adept enough to handle pretty much any crawl logic you'd want, for example:
crawls the websites you specify
Borne from the work done for Homethinking, I am an XML-based data manipulation framework.
I help you map between datasources using regex rules. I have all kinds of validation and conversion goodness you can use to massage your data.
A datasource, in this case, can be a HTML file, an XML file, a database, or even a Solr index.
In the typical vertical search use-case, data is extracted from a HTML file into a database.
extracts data from URLs into a database
I'm Powered by Apache Solr. Nuff said.
Used by a number of large consumer internet destinations, Solr is fast, robust and extensible.
Additionally, I'm pimped out with custom Solr request handlers which give you goodies like:
tricked-out version of Apache Solr
Together we provide the complete backend infrastructure required to create and run a vertical search engine.
Flexile is designed with one type of person in mind:
Well, maybe this type of person might find it useful too...
First, we determine which websites need to crawled.
Then, the crawl policy for each website (or a group of websites) is defined using Flexile's XML-based crawl schema.
Elements of the crawl schema include:
For each of the crawl components such as URL filtering, parsing, URL normalization etc, we'll either use stock Flexile components or coding a custom implementation for your needs.
The extraction phase is only used when some metadata is required to be extracted from the HTML files.
For example, if you're building a job vertical search engine, we'll define extraction rules for fields such as:
Flexile provides a simple and straightforward way to map regex patterns to metadata columns.
Additionally, transformers and validators can be applied to individual columns before outputting them to a DB, or Solr.
The 2 main parts of Solr which need to be configured are the schema.xml and solrconfig.xml files.
The fields defined in the schema will depend on your vertical search app.
As part of the initial customization phase, we'll determine your needs and make the necessary configuration.
If custom Solr components need to be written, they'll be declared in solrconfig.xml
Flexile also supports spatial searches if required by your app.
Crawls can be launched manually or via cron jobs.
At the end of each crawl run, Flexile can write a crawl summary to a MySQL database.
This bit we won't help you with.
We'll be happy to work with your front-end UI wizards to integrate with the searcher and crawler.
Flexile imposes no restrictions on your choice of front-end language or framework.
Popular choices include Ruby, PHP, ASP, Python and JSP/Servlet.
Through Apache Solr, Flexile exposes a HTTP REST interface to your front-end app.
Data formats supported include: XML, JSON, BinaryJava, Serialized PHP etc
Flexile runs off your hosting infrastructure. It is NOT a hosted service.
A vertical search engine, as opposed to a web search engine, provides search results on a particular vertical slice of the internet.
It usually provides a richer search experience because it has privileged knowledge of its domain.
Examples of vertical search engines you've probably already used are Zillow.com (property), Indeed.com (jobs) and YellowPages.com (white pages).
Nope.
A sample of previous uses of the Flexile crawler include:
A sample of previous uses of the Flexile extracter include:
Hell yes!
Unlike most other crawlers, Flexile was built to be embeddable and completely customizable. Every part of the crawl, extract and search phases can be customized.
The Flexile crawler exposes an Event API which programmers can use to hook into the crawl engine. Examples of crawl events are: beforeCrawl, urlQueued, urlFound, urlFetched, etc.
Through these events, programmers can write custom crawl components which customize the behavior of the crawler.
Nutch is web-scale, and difficult to customize.
Flexile is not web-scale, but completely customizable. Flexile has been known to scale to millions of URLs.
In Nutch:
And the list kinda goes on.
Furthermore, with large sites like monster.com with million+ listings, you need to focus the crawling to the bare minimum to get to the jobs instead of crawling the whole site. Otherwise, the turnaround time between crawls is too large and your search results will appear out-dated.
Theoretically, with Hadoop and Nutch, any customization is possible. Practically, speak to anyone who's tried to change Nutch in a non-trivial way and you'll know why Flexile exists.
The general rule of thumb is: any stateless RESTful URL can be crawled by Flexile.
Yes! Flexile can log-in to a site and obtain cookies before the crawl starts. This cookie will then be used for subsequent requests to the site.
Ah... the holy grail of screen-scraping, (semi-)automated crawling and extraction.
Sounds great.. but Flexile is not there yet. If that's what you're looking for, you need to look elsewhere.
From experience, there are quite a large number of sites from which data needs to be extracted which is non-standard and for which information extraction techniques probably will fail. However, IE is not our area of expertise.
As much as we are proponents of open-source software, Flexile is, unfortunately, currently NOT open-source software.
The Flexile vertical search platform was built from the experience launching products like Homethinking.com and Indeed.com, and is the only comprehensive vertical search platform available on the market for which the full source code is available for purchase.
Flexile is available on a binary yearly license, or a one-time perpetual source license. Pls drop us a mail at to inquire about pricing details. Generous discounts are available for academic institutions and non-profit corporations.
If you need a custom crawler built and/or some data extracted from websites, do get in touch with us. Building crawlers is our passion.
We've crawled the impossible and that makes us mighty!
Drop us a mail at to find out how Flexile can jump-start your vertical search project.