The deep web is usually defined as the content on the Web not accessible through a search on general search engines. This content is sometimes also referred to as the hidden or invisible web.
The Web is a complex entity that contains information from a variety of source types and includes an evolving mix of different file types and media. It is much more than static, self-contained Web pages. In fact, the part of the Web that is not static, and is served dynamically "on the fly," is far larger than the static documents that many associate with the Web.
The concept of the deep Web is becoming more complex as search engines have found ways to integrate deep Web content into their central search function. This includes everything from airline flights to news to stock quotations to addresses to maps to activities on Facebook accounts. In the screenshot below, notice the various deep Web sources offered by Google, including images, maps, news, video, shopping, scholarly content, blogs, and so on. However, even a search engine as far-reaching as Google provides access to only a very small part of the deep Web.
Content on the deep Web
When we refer to the deep Web, we are usually talking about the following:
The content of databases. Databases contain information stored in tables created by such programs as Access, Oracle, SQL Server, and MySQL. (There are other types of databases, but we will focus on database tables for the sake of simplicity.) Information stored in databases is accessible only by query. In other words, the database must somehow be searched and the data retrieved and then displayed on a Web page. This is distinct from static, self-contained Web pages, which can be accessed directly. A significant amount of valuable information on the Web is generated from databases.
Non-text files such as multimedia, images, software, and documents in formats such as Portable Document Format (PDF) and Microsoft Word. For example, see Digital Image Resources on the Deep Web for a good indication of what is out there for images.
Content available on sites protected by passwords or other restrictions. Some of this is fee-based content, such as subscription content paid for by libraries or private companies and available to their users based on various authentication schemes.
Special content not presented as Web pages, such as full text articles and books
Dynamically-changing, updated content, such as news and airline flights
This is usually the basic,"traditional" list. In these days of the social Web, let's consider adding new content to our list of deep Web sources. For example:
Discussions and other communication activities on social networking sites, for example Facebook and Twitter
Bookmarks and citations stored on social bookmarking sites
As you can see, based on these few examples, the deep Web is expanding.
A search engine’s web crawler uses hyperlinks to uncover and index content found on the Web. This tactic is ineffective in a search of deep web resources. For instance, search engine crawlers do not look for dynamic web pages that result from queries of databases because there are may be a lot of possible results
These limitations are, however, being overcome by the new search engine crawlers (like Pipl) being designed today. These new crawlers are designed to identify, interact and retrieve information from deep web resources and searchable databases. Google, for example, has developed the mod oai and Sitemap Protocol in order to increase results from deep web searches of web servers. These new developments will allow the web servers to automatically show the URLs that they can access to search engines.
Another solution that is being developed by several search engines like Alacra, Northern Light and CloserLookSearch are specialty search engines that focus only in particular topics or subject areas. This would allow the search engines to narrow their search and make a more in-depth search of the deep web by querying password-protected and dynamic databases.
Deep web or Surface Web
The challenge that researchers in this field face is related to the classification of resources. The area between the surface web and the deep web is a gray area. There are sites that appear to be indexed by search engines but are actually found not by conventional web crawlers but by OAIster, mod_oai or sitemap protocol. Other examples are pages that are found in the surface web but are not yet found by web crawlers.
The research being done in this field of computer science today will be able to provide Internet users more access to the deep web data as well as more meaningful results for their searches. Researchers are currently looking for a way to classify and categorize search results by topics and according to the users’ needs.