ESCAPING SPECIAL CHARS IN LDAP SEARCH FILTERS

When programming against any LDAP backend, it’s good to sanitize any user input that may go into a search filter. A typical case is authentication (login or single sing-on) applications, where an input username or email must be used to resolve a user’s distinct name (DN) in the LDAP directory. Continue reading “ESCAPING SPECIAL CHARS IN LDAP SEARCH FILTERS”

How to search between two "Like" fields in SQL Server 2012

Sometimes you will be asked to filter information and what better way of doing this than using “Like” filters in SQL Server.
Flexible and easy to use, the two types of filter (The percent sign (%) and The underscore (_)) can be used as placeholders for one or more characters.
Continue reading “How to search between two "Like" fields in SQL Server 2012”

How to search a website using Microsoft Indexing services

Microsoft Indexing

Providing search capabilities requires two steps:

  • First, you must create an index of the site’s contents. An index of a website is synonymous to the index in a book. If you want to read about a particular topic in the book you can quickly find the page(s) through the index, as opposed to having to read through the book’s entire contents.
  • Once an index has been created you need to be able to search through the index. Essentially, a user will enter a search string and you’ll need to find the matching, relevant results in the index, displaying them to the user.

Unfortunately, building a search engine for your site is not as straightforward and simple as we’d like. Writing your own indexer, indexing the content, and building code to search the index is definitely possible, but requires a good deal of work. Fortunately there exist a number of indexers out there that you can leverage for your site. Some indexers include commercial products like EasySearchASP.NET and Building31.Search, which are products designed to specifically search an ASP.NET website. Additionally, Microsoft provides its own indexer, Microsoft Index Services.

This article examines using Microsoft Index Services for your site’s search functionality.
With Index Services you can specify a specific group of documents or HTML pages to be indexed, and then create an ASP.NET page that can query this index.

We’ll build a simple, fast, and extensible search tool using .NET and Microsoft Indexing Services along with the Adobe IFilter plug-in, which allows MS Indexing Services to index PDF documents and display them in your search results.

Configuring Microsoft Indexing Services

The first step in creating an index for your search application is to configure Indexing Services on the IIS server that your Web application will be running. To do this you need access to the Web server itself. Open the Microsoft management console by clicking Start, then Run; type mmc and click Ok. Next, to open the Indexing Services snap-in, you must:

  • Click file,
  • Click Add/Remove Snap-In,
  • Click Add,
  • Select the Indexing Service Snap-In,
  • Click Add,
  • Click Finish,
  • Close the dialog

After following these steps you should see something akin to the screenshot below.

To create a new catalog – which is the vernacular Microsoft uses for an index – right-click on the Indexing Service node, click New and then Catalog. You then need to choose a location to store the catalog file. Once you’ve done that, expand the catalog that you just created and click on the directories icon. Right-click on the directories folder, click new directory, and add the directory or directories that contain the content that you want to search. These directories can reside anywhere that the host computer can access, virtual directories and even UNC paths (Servershare) may be used. However, each directory that is indexed must either reside physically, or be included as a virtual directory, in the root of the website that you are indexing. If a directory is specified that is not in the web root via a physical folder or virtual directory, the results will be displayed in your search, but they will return broken links.


Indexing Services will effectively index HTML, Word, and, once properly configured, PDF documents. To ensure that your required directories will be indexed you should verify that the index flag is properly set on the files and folders. You can verify this setting by right clicking on any folder or file and selecting properties. Click the “Advanced button” and make sure that the “For fast searching, allow indexing services to index this folder” checkbox is checked, as shown in the screenshot to the right.

Next, you want to set the properties of this catalog so that the HTML paths can be used, and so that Indexing Services will generate abstracts for the documents as they are indexed. To do this right-click on the catalog you just created and select Properties. On the tracking tab, you’ll need to make sure that the “WWW Server:” field is set to the website that your application will be running from. This ensures that the html paths work as they should when you get to building the
front-end for the search. If you want to display a short bit of each article along with your search results, then go to the Generation tab, uncheck “inherit above settings from service,” then check generate abstracts and set the number of characters you wish to have displayed in each abstract.

If you want your search to include PDF documents, then you must install the Adobe IFilter extension. You can download this free of charge from Adobe: http://www.adobe.com/support/downloads/detail.jsp?ftpID=2611. This plug-in is a standard windows installer and requires no additional configuration. After the plug-in has been installed, PDF documents will automatically be included in the search results as they are indexed without any user intervention or configuration required.

When you navigate to the Directories folder in the catalog that you’ve created, you may notice that there one or more directories appear in addition to the ones you added in the previous step. These are website shares added automatically by Indexing Services, they and need to be excluded from indexing if you don’t want your search to include them. To exclude these directories, you must find them in the file system via windows explorer. Next, right click the folder and choose Properties.

From the dialog that appears click advanced and uncheck the box that says “For fast searching, allow index services to index this folder.” (See the screenshot above) This will exclude the folder from your search results. The configuration of indexing services is now complete.  As you can see, an index may include as little as one folder of documents or as much as an entire website or group of websites. It’s up to you to determine the breadth of the index. However, since Index Services does not crawl links like a spider, it will only catalog file system objects. Thus, the results from this search will include static files such as HTML pages, Word documents, and PDF documents, but not any dynamically generated pages. Changes made to these static documents will be picked up by Indexing Services and will very quickly be reflected in your search results.

Searching the Index

Once the index has been created, the next step is to build a search page that allows the website visitor to search through the index. To do this, you need, at minimum, a TextBox Web control for the end user to enter search terms, a Button Web control to initiate the search, and a Repeater control to display the results. The following markup shows the bare minimum markup for a simple search page:

Enter your search query:
<asp:TextBox id="txtSearch" runat="server"/>

<asp:Button
      id="btnSearch"
      runat="server"
      Text="Search"
      OnCommand="btnSearch_Click"
      CommandName="search"
/>

<hr>

<asp:Repeater id="searchResults" runat="server">
    <HeaderTemplate></HeaderTemplate>

    <ItemTemplate>
        <%# DataBinder.Eval(Container.DataItem, "File") %> <br>
        <%# DataBinder.Eval(Container.DataItem, "FileAbstract") %> <br>

    </ItemTemplate>

    <SeparatorTemplate><br></SeparatorTemplate>
</asp:Repeater>


This results page will display a list of results with a line for the document title followed by the abstract, which is generated by indexing services. Let’s take a look at the code-behind class.

In the code-behind page, an OleDbConnection is attached to the Indexing Services catalog that we set up earlier. Once connected, the catalog can be searched using a variety of query languages, including SQL syntax. You can read about each of the language options here: Query Languages for Indexing Services. For this example, I’m going to use the IS Query Language to perform a freetext search
which allows for natural language search capabilities, but you can modify your search to use Boolean, phrase, or any of the query types that indexing services support. 


To set up the connection to the indexing services catalog you need to set up a OleDB connection as follows:


// create a connection object and command object, to connect the Index Server

System.Data.OleDb.OleDbConnection odbSearch = new System.Data.OleDb.OleDbConnection( "Provider="MSIDXS";Data Source="docSearch";");

System.Data.OleDb.OleDbCommand cmdSearch = new System.Data.OleDb.OleDbCommand();

// assign connection to command object cmdSearch

cmdSearch.Connection = odbSearch;

// Execute the query using freetext searching and sort by relevance ranking

//Query to search a free text string in the contents of the indexed documents in the catalog

string searchText = txtSearch.Text.Replace(“‘”,”””);

cmdSearch.CommandText = “select doctitle, filename, vpath, rank, characterization from scope() where FREETEXT(Contents, ‘”+ searchText +”‘) order by rank desc “;

The fields returned from querying the index include:

  • Doctitle: The title of document, which is the text between the <title> tags in an HTML document or
    the text in the title field of a word or PDF document.
  • Filename: The physical name of the file that the result was returned from.
  • Vpath: The virtual path of the file where the result was returned from. This is the field you use to specify an HTML
    link to the file.
  • Rank: The relevance of the returned result.
  • Characterization: The abstract for the document, usually the first 320 characters.

 

Tying it All Together

While there are a number of ways in which you can display the results from your search, a Repeater is likely the most efficient and gives you the greatest control on how your results are formatted. The sample application that is attached demonstrates how to bind the results of your search to a Repeater control. It also adds paging functionality to the results that will make the results easier to use, as shown in the screenshot below.
The results can easily be modified to show paths to documents or display the rankings of each result.

Conclusion

This search tool is small, fast, and simple enough to deploy for searching individual folders of specific content in your intranet or Internet site. However, it can easily be used to search entire sites composed of thousands of documents. Due to the power of Microsoft Indexing Services, all that you will need to do is alter the scope of the Indexing Services catalog to include any folders you want indexed. Adding new documents to these folders or modifying the existing documents will automatically be picked up by Indexing Services.

For more information about Indexing Services, be sure to read the following resources:

Introduction to MS Index Server

Google Deleted 100 Million Search Results in 2013

Since the beginning of the current year rights owners have asked the search giant to remove over 100 million links to “pirate” websites. This figure is already double the number Google processed for the whole last year. Google is currently processing an average of 15 million “pirate” links per month. Although this number is leveling off, the rights owners aren’t satisfied yet.

Trying to steer prospective customers away from illegal websites, rights owners keep sending the search engine millions of DMCA takedown requests. Google, on its side, is trying to give the public insight into the scope and nature of this process – this is why it started publishing details of all takedown requests in its Transparency Report. It turned out that since last year the number of URLs the company is being asked to remove has exploded.

Thus far, Google has been required to delete more than 105,300,000 links to infringing websites, and most of them don’t appear in search results anymore.

scrub

As for the websites for which the company received the most takedown notices, the file-hosting search engine FilesTube tops the rankings with almost 6,000,000 URLs. Another “rogue” website is Torrentz.eu with over 2,500,000 URLs, followed by Rapidgator.net with more than 2,000,000 links. The surprising fact is that infamous The Pirate Bay didn’t show up in the top 20. Maybe this is because it changed domain names, or maybe because it hosts just 2,000,000 magnet links on the website.

Talking about the reporting groups, we can see that the Recording Industry Association of America is one of the most active senders of DMCA takedown requests. The anti-piracy outfit has sent takedown requests for over 26 million URLs within the last year and half. Despite the fact that Google responds swiftly, the entertainment industry doesn’t believe the takedowns are efficient. This is why it now asks the search giant to ban entire domains from its search results.

On the one side, the company is satisfied with the way things are going, saying that it has faith in the general workings of the DMCA takedown procedure. The only problem with the massive number of takedowns is that thousands of links are taken down in error – for example, Microsoft recently asked to remove its very own website from the search results.

In the meantime, the industry experts note that it would be interesting to see how the tension between the search engine and the rights owners develops over time.

Google could not fulfill its promise to reduce its dependence on search revenue

The world-wide known search engine Google had a 5-year plan intended to reduce its dependence on search revenue to 65% by next year. Which, apparently, didn’t work right.

According to the local media reports, this figure appeared in the paperwork related to the Google vs. Oracle trial and showed that in fact Google was off its targets by many miles. The plan saw the company receiving over 35% of its 2013 revenue from outside its search operation. That’s what Google wanted to reach back in 2010.

It seems that Internet commerce and an initiative to bring Google services to TV were on the list of things Google was going to shift to. After the experts saw the paper, they most likely laughed and assumed that Google TV and commerce ambitions didn’t really happen. According to Herman Leung, Susquehanna Financial Group analyst, two years ago Google was a little more aggressive than it is now.

Actually, the projections for the company’s different businesses were part of a presentation to Google’s board of directors two years ago. The search giant tried to convince American District Judge William Alsup to keep the papers secret, saying they were commercially sensitive. At the same time, making the company look silly over failing to meet its goals doesn’t count as “commercially sensitive” to the judge. Jim Prosser, a spokesman for the company, claimed that the papers didn’t represent current thinking about its business operations, but he forgot to say why it was so important to have this data suppressed.

The paper also reveals how Google sees an emerging threat from cooperation between the largest social network in the world, Facebook, and Microsoft’s Bing search engine. Google was worried that Facebook-Bing users might bypass it. Meanwhile, its own YouTube business was estimated to generate $5 billion by 2013, due to a $3 billion contribution from its own TV project, which actually never happened.