August 22, 2017, Comment off
Last couple of years have been the years of Data Revolution. While 2015 set the platform, 2016 saw intrigue and early adoption, and 2017 has started to see the penetration of Data Analytics from Enterprise to SMB’s and Startups. The year ahead and the one to come will see it’s penetration to the last miles. Every Business today is thinking about leveraging data to steer ahead, and if the business needs to crawl web data, then the question ‘build vs buy’ web crawling solution is virtually a no brainer.
Crawled web data has become a critical component of any big data analytics operations, from Enterprise to a Startup. Web Crawling, cleaning and structuring the data at scale poses it’s own complex challenges from both legal and technical perspectives.
Many would argue that web crawling is an illegal and unethical practice, but that is a topic of discussion for another day (a US court just allowed a Startup to crawl Linkedins publically available data).
Technically, the approach to developing crawling capabilities internally and maintaining them is not as easy as it sounds. Unless you’re doing it at a massive scale (min GB’s of data every day), building your own web data crawling solution is not only inefficient but also not cost effective.
You need to calculate the true cost of ownership of developing and maintaining an internal web crawling solution before you start building one.
In a world driven by technology, when more and more people and companies are contributing to open source, the temptation to develop a proprietary web crawling solution is irresistible. Building a crawling framework from scratch or modifying an existing solution means hiring at least a full-time employee or diverting an existing employee from current projects. The goal is to deliver a stable and robust solution with comprehensive coverage, granularity, and scalability and this usually takes a couple of months and signifies a substantial cost.
The Internet keeps on changing and web crawlers need constant updates. Requirements keep changing, new data sources for data crawling keep adding up and one has to constantly redesign the crawlers to meet one’s specific needs. A full-time person is required to take care of the maintenance of the crawling infrastructure.
The crawler has to be constantly running on a dedicated machine 24/7. There will be a need to store and crunch the data crawled. The Data or Analytics have to be made highly available. Depending on your requirements, it can easily set you back by 100’s of dollars.
If you’re not in the web crawling business, chances are that the approach to building the web crawlers has rendered them hard to scale. With time, you’ll have to add more crawlers, add more sources to crawl, add more data points per crawl, filter content and the list goes on. Servers and databases have to be replicated and more processes have to be scheduled and automated. Data cleaning and structuring alone is 70% of the work. Indexing and data backups add to the job. Over time, you’ll need to hire even more developers to develop a more robust solution that can keep up with the dynamic nature of the web.
It makes sense to rely on a vendor for efficient and scalable web data acquisition. Still, the cost could remain prohibitive even if it’s more affordable than developing the capability internally.
There are many companies that provide custom web crawling services. Make sure that you decide on the value of the data before setting a price for the crawler. You will have an option to buy the web crawling scripts and then use and maintain them yourself, or just buy the data. Also make sure that you get a structured stream of data to your database, or application.
Market research firms, data providers, web crawling service providers also sell crawled datasets to companies and individuals.
Using API enables you to use custom data streams and filtered data points. API provider has an engineering team that develops and maintains the crawler stability and coverage. API service providers typically crawl data in high demand at scale and sell to multiple sources thereby reducing the prices. There are tier’s with segmented data access or pay per use plans, which enable you to choose a plan that you can afford.
Depending on the complexity of scraping and depth of data, the data sets may be cheap or costly. The seller might have upfront one-time pricing, annual charges for API or a pay per use plan.
Many companies endure the painful process of selecting the right web crawling service – only to discover the cost of support and inadequate response time make the entire endeavor impractical.
Here’s what you can do with StartupFlux API’s (Demo coming soon) :
Build powerful applications or integrate StartupFlux into your system, processes, web and mobile applications with the REST API. The StartupFlux API is a read-only RESTful service that enables developers to leverage the same data that powers https://startupflux.com