Agex

How Web Crawlers Work

A web crawler (also known as a web spider or web robot) is a program or automated script which browses the internet seeking for web pages to process.

Many applications mostly search engines, crawl websites everyday in order to find up-to-date data. Most of the web crawlers save a copy of the visited page so they could easily index it later and the rest crawl the pages for page search purposes only such as searching for emails ( for SPAM ).

How does it work?

A crawler needs a starting point which would be a web address, a URL.

In order to browse the internet we use the HTTP network protocol which allows us to talk to web servers and download or upload data from and to it.

The crawler browses this URL and then seeks for hyperlinks (A tag in the HTML language).

Then the crawler browses those links and moves on the same way.

Up to here it was the basic idea. Now, how we move on it completely depends on the purpose of the software itself.

If we only want to grab emails then we would search the text on each web page (including hyperlinks) and look for email addresses. This is the easiest type of software to develop.

Search engines are much more difficult to develop.

When building a search engine we need to take care of a few other things.

1. Size - Some web sites are very large and contain many directories and files. It may consume a lot of time harvesting all of the data.

2. Change Frequency ? A web site may change very often even a few times a day. Pages can be deleted and added each day. We need to decide when to revisit each site and each page per site.

3. How do we process the HTML output? If we build a search engine we would want to understand the text rather than just treat it as plain text. We must tell the difference between a caption and a simple sentence. We must look for bold or italic text, font colors, font size, paragraphs and tables. This means we must know HTML very good and we need to parse it first. What we need for this task is a tool called "HTML TO XML Converters". One can be found on my website. You can find it in the resource box or just go look for it in the Noviway website: www.Noviway.com.

That's it for now. I hope you learned something.

Eran Aharonovich Software Programmer Home Page: http://www.Noviway.com Web Crawler Page: http://www.noviway.com/Code/Web-Crawler.aspx HTML To XML Converter Page: http://www.noviway.com/Code/HTML-To-XML.aspx


Rate This Article:
Google

Site Map | Home

This Site Is For Sale

Free E-books, Sell Resell Rights High Quality Free Ebooks
Genuinely, to have any chance of making money on the Internet, it is extremely important to understand the golden rule of Internet marketing The golden rule for Internet marketing is simple as ABC

A Beginners Guide to Web Hosting
What is web hosting? Whenever you visit a website, what you see on your web browser is essentially just a web page that is downloaded from the web server onto your web browser. In general, a web site is made up of many web pages.

Selling With Your Own Web Site
Many people have their own product, a book, a CD, a craft, or other product that they would like to sell with their own Web site, but they can't find simple instructions on how to get started. In this article, I'm going to explain how to sell your product with your own Web site.

Designing your Web Site
Designing your Web Site It is important to plan and outline the design on which you are going to base your work. Like any other work, Web page designing and publishing also needs a design layout and some planning.

Knowledge Base Software Released by Web-Site-Scripts.com - More Than 50 Improvements
Web-Site-Scripts Company presented new version of knowledge base software () - KnowledgeBase Manager Pro 5.1. This application helps to organize collaboration work & customer support, automate business knowledge management, and create intranet knowledge base.

Why not have a one web page design until you are ready for a full web site?
A number of web design customers coming to us have only recently set up a business. In fact some have not even set up their business and are exploring ideas with regards to a web design. Designing a full fledge web site can be costly, depending on what type of web design you are looking for. If you are not sure about your business, your products and services, let alone your corporate identity, you might not be ready to invest into a web design project. If you have just started your business a one page web site or web page design might be the answer for you. The one web page design has several advantages.

Hot Scripts Hits Major Milestone with Extreme Makeover
Largest Web Resources Directory Enters 11th Year with a Fresh, New Design - Initially launched in late 1998, Hot Scripts is entering its eleventh year as the largest and highest quality directory with over 41,000 web development and programming-related resources. As a prominent destination site for millions of webmasters, developers, business owners, and programmers seeking free and commercial web resources, the new Hot Scripts marks an exciting milestone in the site's history. A new design and, more importantly, an entirely re-written technology platform, enable Hot Scripts to further fuel significant growth and richer functionality. Besides ongoing tweaks and enhancements, in the coming months Hot Scripts will roll out several n ...

Web Design Is Important To Your Customers
The web design of your web site may seem unimportant compared to functionality or marketing, but the web design of your web site is debatable the most important part of your online business. The web design you choose to use, will not only attract a certain type of people, but also repel another. When making your web design there are three main things to keep in mind, the navigability of your web design, the look of your web design, and the overall cleanliness of your web design.

Choosing the Right Web Designer
Creating your web site can be a tricky process. Choosing the best web design company for your site is extremely important.

The Beginner's Guide To Cheap Web Hosting
What is web hosting? Whenever you traverse a website, what you deliberate on your net browser is essentially fitting a lacework page that is downloaded from the interlacing server onto your net browser. In general, a interlacing locus is made up of many web pages. And a web page is basically composed of texts and graphic images. All these web pages need to be stored on the web servers so that online users can visit your website. Therefore, if you plan to own a new website, you will need to host your website on a web server. When your website goes live on the web server, online users can then browse your website on the Internet. Company that provides the web servers to host your website is called web hosting providers. A well-established web ...


Privacy Policy | Copyright/Trademark Notification