Discovering The Keyword Matrix

The secret of writing great online copy is simple: use keywords – the words people use when they search – in your copy.
The reasons for this are two fold:
1. If you use the words people use when they search in your copy, page titles, descriptions and links, then you’ll score well in search engine results and more people will find your site.
2. The language you use will be appropriate to your audience. And as any seasoned direct marketer will tell you, speak to people in their own language and they will be more likely to buy from you.
That of course begs the question, “How do you find your best keywords”?
Most people either just guess or do perfunctory research using one of the many free keyword tools that are found online.
But if you want to succeed online, you’ve got to spend enough time and energy discovering the words your customers really use when they search.

You’ll need not just 20-30 keywords, you’ll need hundreds – and many of the most successful online companies will have thousands of keywords.

The keyword matrix is a simple methodology that allows you to build a comprehensive keyword list using the Wordtracker database.
Here are the 3 steps involved:
Step 1 – Start with a common word or phrase that is appropriate to your business Start simple. Your starting point does not need to be clever or creative. All you need is a common word or phrase relevant to your business.
So if I was researching for the market for chocolate, a good starting point would be the word ‘chocolate’.
Excellent. So far, so good. Now all you have to do is rush off and optimize your web pages for those phrases, right?

Wrong! That is what most people will do and as a result they will miss out on some very powerful keywords which aren’t immediately obvious.
Time to move on to step 2 and get inside the keyword matrix.

Step 2 – Find words that are related to your original keyword
Here the task is not to find relevant keywords that contain the word ‘chocolate’: in fact, it is just the opposite – to find relevant keywords that do NOT contain the word ‘chocolate’ like truffles, cocoa, fudge, confectionery, gifts, gourmet and so on.
Scan the list of related keywords and pick the ones that YOU feel are relevant to your business.

Step 3 – Use the related words to generate many more relevant keywords
Essentially, this repeats step 1 for each of the related keywords that you have chosen. Take each of your chosen related words and enter them into the Keyword Researcher tool in turn.
By following this methodology, you can quickly generate hundreds of relevant keywords that reflect the subtleties – and niche markets – within your marketplace.

Final words
The average person will do their keyword research once and then forget about it. But that is the way to get average results.
Effective keyword research is an ongoing process and to get outstanding results you must work at it regularly.
To succeed in keyword research you should:

1. Regularly check your keyword counts on Wordtracker – they can go up and down over time
2. Continue to add more keywords to your matrix. The more effective keywords you have, the more profitable your online business will be.
3. Monitor how well your keywords do. Performance will always be a mixture of:
• Keywords that bring good traffic and good conversions (these are the words people use when they are in buying mode)
• Keywords that bring you good traffic but poor conversions (these are the words people use when they are in research mode)
•Keywords that bring low traffic but great conversions (these are the words that represent buying behaviour in niche markets).
Overall, there will be a mix of such keywords in any comprehensive keyword matrix.
In the next lesson, we’ll look at how you can pick the most competitive keywords from the sample matrix that we have developed for ‘chocolate’  and how you can use those keywords in your web site copy.

Advertisements

Inside the guts of a search engine

if you want to succeed online, you’ve got to spend enough time and energy discovering the words your customers really use when they search.

For simplicity’s sake, let’s say there are three pieces of software that together make up a search engine – the Spider software, the Index software and the

Query software.
If you understand what these three do, then you have the foundation for getting your website to the top of the search engines.
Here’s what the three types of software do:
The Spider software ‘crawls the web looking for new pages to collect and add to the search engine indices’.
This is a metaphor. In reality, the spider doesn’t do any ‘crawling’ and doesn’t ‘visit’ any web pages. It requests pages from a website in the same way as Microsoft Explorer, or Firefox or whatever browser you use requests pages to display on your screen.
The difference is that the spider doesn’t collect images or fancy designs – it is only interested in text and links AND the URL, (for example, http://www.Unique-Resource-Locator.html) from which they come: it doesn’t display anything and it gets as much information as it can in the shortest time possible.
A spider loves links because they lead it to other web pages that have the things that it loves, guess what? more text, links and URLs!
The Index software catches everything the Spider can throw at it (yes, that’s another metaphor). The index makes sense of the mass of text, links and URLs using what is called an algorithm – a complex mathematical formula that indexes the words, the pairs of words and so on.

Essentially, an algorithm analyses the pages and links for word combinations and assigns scores that allow the search engine to judge how important the page (and URL) might be to the person that is searching. And of course it stores all of this information and makes it available to people who are searching.

The Query software is what you see when you use a search engine – it is the front end that everybody thinks of as a search engine. It may look simple but it presents the results of all the remarkable search engine software that works away invisibly on our behalf.
The main feature of the query software is the box into which people type their search terms.
Type in your words, hit search and the search engine will try to match your words with the best web pages in can find through  searching the web.
But this too is a metaphor and perhaps the most important one.
The query software doesn’t search the web – it checks the records that have been created by its own index software. And those records have been made possible by the raw material the spider software collects.

What you need to understand about search engines
What you need to understand is that the search engine has done all the hard work of collecting and analysing web pages, BUT it only makes that information available when someone does a search by entering words in the search box and hitting return.
The words people use when they search therefore determine the results the search engine presents. We call them keywords – that might sound fancy but keywords are only ‘the words people use when they search’.
Use keywords in your website copy and you will prosper: ignore them and your online business will surely perish.

How to search a website using Microsoft Indexing services

Microsoft Indexing

Providing search capabilities requires two steps:

  • First, you must create an index of the site’s contents. An index of a website is synonymous to the index in a book. If you want to read about a particular topic in the book you can quickly find the page(s) through the index, as opposed to having to read through the book’s entire contents.
  • Once an index has been created you need to be able to search through the index. Essentially, a user will enter a search string and you’ll need to find the matching, relevant results in the index, displaying them to the user.

Unfortunately, building a search engine for your site is not as straightforward and simple as we’d like. Writing your own indexer, indexing the content, and building code to search the index is definitely possible, but requires a good deal of work. Fortunately there exist a number of indexers out there that you can leverage for your site. Some indexers include commercial products like EasySearchASP.NET and Building31.Search, which are products designed to specifically search an ASP.NET website. Additionally, Microsoft provides its own indexer, Microsoft Index Services.

This article examines using Microsoft Index Services for your site’s search functionality.
With Index Services you can specify a specific group of documents or HTML pages to be indexed, and then create an ASP.NET page that can query this index.

We’ll build a simple, fast, and extensible search tool using .NET and Microsoft Indexing Services along with the Adobe IFilter plug-in, which allows MS Indexing Services to index PDF documents and display them in your search results.

Configuring Microsoft Indexing Services

The first step in creating an index for your search application is to configure Indexing Services on the IIS server that your Web application will be running. To do this you need access to the Web server itself. Open the Microsoft management console by clicking Start, then Run; type mmc and click Ok. Next, to open the Indexing Services snap-in, you must:

  • Click file,
  • Click Add/Remove Snap-In,
  • Click Add,
  • Select the Indexing Service Snap-In,
  • Click Add,
  • Click Finish,
  • Close the dialog

After following these steps you should see something akin to the screenshot below.

To create a new catalog – which is the vernacular Microsoft uses for an index – right-click on the Indexing Service node, click New and then Catalog. You then need to choose a location to store the catalog file. Once you’ve done that, expand the catalog that you just created and click on the directories icon. Right-click on the directories folder, click new directory, and add the directory or directories that contain the content that you want to search. These directories can reside anywhere that the host computer can access, virtual directories and even UNC paths (Servershare) may be used. However, each directory that is indexed must either reside physically, or be included as a virtual directory, in the root of the website that you are indexing. If a directory is specified that is not in the web root via a physical folder or virtual directory, the results will be displayed in your search, but they will return broken links.


Indexing Services will effectively index HTML, Word, and, once properly configured, PDF documents. To ensure that your required directories will be indexed you should verify that the index flag is properly set on the files and folders. You can verify this setting by right clicking on any folder or file and selecting properties. Click the “Advanced button” and make sure that the “For fast searching, allow indexing services to index this folder” checkbox is checked, as shown in the screenshot to the right.

Next, you want to set the properties of this catalog so that the HTML paths can be used, and so that Indexing Services will generate abstracts for the documents as they are indexed. To do this right-click on the catalog you just created and select Properties. On the tracking tab, you’ll need to make sure that the “WWW Server:” field is set to the website that your application will be running from. This ensures that the html paths work as they should when you get to building the
front-end for the search. If you want to display a short bit of each article along with your search results, then go to the Generation tab, uncheck “inherit above settings from service,” then check generate abstracts and set the number of characters you wish to have displayed in each abstract.

If you want your search to include PDF documents, then you must install the Adobe IFilter extension. You can download this free of charge from Adobe: http://www.adobe.com/support/downloads/detail.jsp?ftpID=2611. This plug-in is a standard windows installer and requires no additional configuration. After the plug-in has been installed, PDF documents will automatically be included in the search results as they are indexed without any user intervention or configuration required.

When you navigate to the Directories folder in the catalog that you’ve created, you may notice that there one or more directories appear in addition to the ones you added in the previous step. These are website shares added automatically by Indexing Services, they and need to be excluded from indexing if you don’t want your search to include them. To exclude these directories, you must find them in the file system via windows explorer. Next, right click the folder and choose Properties.

From the dialog that appears click advanced and uncheck the box that says “For fast searching, allow index services to index this folder.” (See the screenshot above) This will exclude the folder from your search results. The configuration of indexing services is now complete.  As you can see, an index may include as little as one folder of documents or as much as an entire website or group of websites. It’s up to you to determine the breadth of the index. However, since Index Services does not crawl links like a spider, it will only catalog file system objects. Thus, the results from this search will include static files such as HTML pages, Word documents, and PDF documents, but not any dynamically generated pages. Changes made to these static documents will be picked up by Indexing Services and will very quickly be reflected in your search results.

Searching the Index

Once the index has been created, the next step is to build a search page that allows the website visitor to search through the index. To do this, you need, at minimum, a TextBox Web control for the end user to enter search terms, a Button Web control to initiate the search, and a Repeater control to display the results. The following markup shows the bare minimum markup for a simple search page:

Enter your search query:
<asp:TextBox id="txtSearch" runat="server"/>

<asp:Button
      id="btnSearch"
      runat="server"
      Text="Search"
      OnCommand="btnSearch_Click"
      CommandName="search"
/>

<hr>

<asp:Repeater id="searchResults" runat="server">
    <HeaderTemplate></HeaderTemplate>

    <ItemTemplate>
        <%# DataBinder.Eval(Container.DataItem, "File") %> <br>
        <%# DataBinder.Eval(Container.DataItem, "FileAbstract") %> <br>

    </ItemTemplate>

    <SeparatorTemplate><br></SeparatorTemplate>
</asp:Repeater>


This results page will display a list of results with a line for the document title followed by the abstract, which is generated by indexing services. Let’s take a look at the code-behind class.

In the code-behind page, an OleDbConnection is attached to the Indexing Services catalog that we set up earlier. Once connected, the catalog can be searched using a variety of query languages, including SQL syntax. You can read about each of the language options here: Query Languages for Indexing Services. For this example, I’m going to use the IS Query Language to perform a freetext search
which allows for natural language search capabilities, but you can modify your search to use Boolean, phrase, or any of the query types that indexing services support. 


To set up the connection to the indexing services catalog you need to set up a OleDB connection as follows:


// create a connection object and command object, to connect the Index Server

System.Data.OleDb.OleDbConnection odbSearch = new System.Data.OleDb.OleDbConnection( "Provider="MSIDXS";Data Source="docSearch";");

System.Data.OleDb.OleDbCommand cmdSearch = new System.Data.OleDb.OleDbCommand();

// assign connection to command object cmdSearch

cmdSearch.Connection = odbSearch;

// Execute the query using freetext searching and sort by relevance ranking

//Query to search a free text string in the contents of the indexed documents in the catalog

string searchText = txtSearch.Text.Replace(“‘”,”””);

cmdSearch.CommandText = “select doctitle, filename, vpath, rank, characterization from scope() where FREETEXT(Contents, ‘”+ searchText +”‘) order by rank desc “;

The fields returned from querying the index include:

  • Doctitle: The title of document, which is the text between the <title> tags in an HTML document or
    the text in the title field of a word or PDF document.
  • Filename: The physical name of the file that the result was returned from.
  • Vpath: The virtual path of the file where the result was returned from. This is the field you use to specify an HTML
    link to the file.
  • Rank: The relevance of the returned result.
  • Characterization: The abstract for the document, usually the first 320 characters.

 

Tying it All Together

While there are a number of ways in which you can display the results from your search, a Repeater is likely the most efficient and gives you the greatest control on how your results are formatted. The sample application that is attached demonstrates how to bind the results of your search to a Repeater control. It also adds paging functionality to the results that will make the results easier to use, as shown in the screenshot below.
The results can easily be modified to show paths to documents or display the rankings of each result.

Conclusion

This search tool is small, fast, and simple enough to deploy for searching individual folders of specific content in your intranet or Internet site. However, it can easily be used to search entire sites composed of thousands of documents. Due to the power of Microsoft Indexing Services, all that you will need to do is alter the scope of the Indexing Services catalog to include any folders you want indexed. Adding new documents to these folders or modifying the existing documents will automatically be picked up by Indexing Services.

For more information about Indexing Services, be sure to read the following resources:

Introduction to MS Index Server

How to add a multi-page TIFF to a PDF using iTextSharp (VB.NET)

If you are ever stuck with a multi-page tiff, you will know how frustrating it is to get the images out and then add them to a pdf. This code snippet below shows how it’s done and feel free to leave a comment if you have used it and it worked.

'check if image exists
                Dim fi As New FileInfo(strImagePath)
                If fi.Exists Then

                    'save image in document
                    Dim gif As Image = Image.GetInstance(fi.FullName)
                    Dim pageWidth = doc.PageSize.Width - (10 + 10)
                    Dim pageHeight = doc.PageSize.Height - (40 + 10)

                    If fi.Extension.ToLower.EndsWith("tif") Then
                        Dim bmp As New System.Drawing.Bitmap(fi.FullName)
                        Dim total As Integer = bmp.GetFrameCount(System.Drawing.Imaging.FrameDimension.Page)
                        If total > 1 Then
                            For k As Integer = 0 To total - 1
                                doc.NewPage()
                                bmp.SelectActiveFrame(System.Drawing.Imaging.FrameDimension.Page, k)
                                Dim img As iTextSharp.text.Image
                                img = iTextSharp.text.Image.GetInstance(bmp, System.Drawing.Imaging.ImageFormat.Bmp)
                                img.SetAbsolutePosition(10, 40)
                                img.ScaleToFit(pageWidth, pageHeight)
                                doc.Add(img)
                            Next
                        End If
                    Else
                        doc.NewPage()
                        gif.SetAbsolutePosition(10, 40)
                        gif.ScaleToFit(pageWidth, pageHeight)
                        doc.Add(gif)

                    End If
                End If
            End If

SQL Procedures and Functions that any DBA should know

large_about_prodba-430x319Have you ever needed very fast a script you can use in SQL and did not want to waste any time looking for it on Google? We have put together a list of most used SQL queries for your database performance.

For example purposes, let’s assume that you have a table called Company with a primary key on CompanyID (identity column, int).

The rest of the data in the table will be mentioned when time comes.

Primary Keys

In Table Creation


CREATE TABLE Company
(
CompanyId int NOT NULL,
CompanyName varchar(255) NOT NULL,
Address varchar(255),
City varchar(255),
CONSTRAINT pk_Company PRIMARY KEY (CompanyId)
)

After Table Creation


ALTER TABLE Company ADD PRIMARY KEY (CompanyId)

or:

ALTER TABLE Company ADD CONSTRAINT pk_Company PRIMARY KEY (CompanyId)

Drop Key

ALTER TABLE Company
DROP CONSTRAINT pk_Company


 

Indexes

Create Index

CREATE INDEX IX_Company_City ON Company (City)

Create Unique Index

CREATE UNIQUE INDEX IX_Company_City ON Company (City)

Among the resources that shape indexes, the most elemental ones at your disposal are the statements
that create the index and the two system stored procedures,

sp_helpindex
and sp_help @tablename.

As an example, I asked the Object Browser in Query Analyzer to create a script for a table as a create.

I then invoked sp_helpindex to get the details of the indexes on this table, Company.

Output

The index named PK_Company was created as the result of a constraint creating a primary key.

PK_Company is a clustered index, meaning that the table is in the order specified by the index key,
in this case, the identity column CompanyID.
The index created has only one column, and since it was a primary key constraint, it also is unique and not null.
Because I didnt specify the direction of the index, it was created in the default of ascending order.
Since the clustered index was created on the Primary file group, the table and its index were created there. Clustered indexes are always created on the same file group as the data, because the leaf level or lowest level of a clustered index is the data itself. (In fact, you can move a table to a different file group by creating a clustered index for the table on a different file group. Both structures will move. Additionally, clustered indexes are effective for range searches because of their proximity to the data and because they sort the table in index order.)

The Unintended Consequences of Clustered Indexes


Be sure to analyze the retrieval requests for your system and design your indexes to accommodate most of those requests. You’ll have to accept tradeoffs between the most frequently executed queries and the less frequently executed but more demanding queries.

The optimizer can chose to use an existing index based on its statistics. However, if the statistics indicate that reading the whole table would be cheaper, the optimizer will perform a table scan. The database can maintain index statistics with the Auto_Create_Statistics option set, or you can update them manually. Depending on how busy your server is, you may elect to perform updates at less busy times. As a side benefit, you also can keep statistics on non-indexed columns, which could help query execution.

Clustered and non-clustered indexes are interrelated by virtue of the clustering index key. You can promote efficiency in index maintenance by specifying clustered indexes first, because the non-clustered indexes use the clustered indexs pointers. If you specify a non-clustered index before a clustered index and then drop the clustered one, you are forcing the server to do double the work. When you specify a clustered index, you create the clustering key that succeeding secondary or non-clustered indexes use. If the clustered index comes after the non-clustered index, the server will drop and recreate the non-clustered indexes so it can inform them about where to find pages with the clustering key. To avoid this extra work, use the Drop Existing clause of the Create Index statement.

Another consequence of clustered indexes is you can influence the physical location of the data file by using the ON FILEGROUP clause. If you designate a different file group for the index, you can move the table to the named file group in the clustered index order. Because the clustered index in effect is the data, one less intermediate leaf level lies between the data and the index. Youll have to invoke the Alter Database command to create the new file group, then relocate your table to the new file group with a Create Index statements On Filegroup option.

If you elect this Drop Existing option after the fact (i.e., your table already exists and you have already created the new file group), you can use the create index command with the DROP EXISTING clause to prevent the double build and rebuild of the non-clustered indexes.


DBCC DROPCLEANBUFFERS

DBCC DROPCLEANBUFFERS


Use this command to remove all the data from SQL Server’s data cache (buffer) between performance tests to ensure fair testing. Keep in mind that this command only removes clean buffers, not dirty buffers. Because of this, before running the DBCC DROPCLEANBUFFERS command, you may first want to run the CHECKPOINT command first. Running CHECKPOINT will write all dirty buffers to disk. And then when you run DBCC DROPCLEANBUFFERS, you can be assured that all data buffers are cleaned out, not just the clean ones.

DBCC ERRORLOG

DBCC ERRORLOG


If you rarely restart the mssqlserver service, you may find that your server log gets very large and takes a long time to load and view. You can truncate (essentially create a new log) the Current Server log by running DBCC ERRORLOG. You might want to consider scheduling a regular job that runs this command once a week to automatically truncate the server log. As a rule, I do this for all of my SQL Servers on a weekly basis. Also, you can accomplish the same thing using this stored procedure: sp_cycle_errorlog.

Procedure Caching

DECLARE @intDBID INTEGER

SET @intDBID = (SELECT dbid FROM master.dbo.sysdatabases WHERE name = ‘database_name’)

DBCC FLUSHPROCINDB (@intDBID)


DBCC FLUSHPROCINDB : Used to clear out the stored procedure cache for a specific database on a SQL Server, not the entire SQL Server.
The database ID number to be affected must be entered as part of the command.

You may want to use this command before testing to ensure that previous stored procedure plans won’t negatively affect testing results.
If you want to clear the cache for all databases, use: DBCC FREEPROCCACHE
If you are unsure about the procedure cache is being used, run this command: DBCC PROCCACHE

DBCC INDEXDEFRAG

DBCC INDEXDEFRAG (Database_Name, Table_Name, Index_Name)


In SQL Server 2000, Microsoft introduced DBCC INDEXDEFRAG to help reduce logical disk fragmentation. When this command runs, it reduces fragmentation and does not lock tables, allowing users to access the table when the defragmentation process is running. Unfortunately, this command doesn’t do a great job of logical defragmentation.

The only way to truly reduce logical fragmentation is to rebuild your table’s indexes. While this will remove all fragmentation, unfortunately it will lock the table, preventing users from accessing it during this process. This means that you will need to find a time when this will not present a problem to your users.

Of course, if you are unable to find a time to reindex your indexes, then running DBCC INDEXDEFRAG is better than doing nothing.

DBCC REINDEX

DBCC DBREINDEX(‘table_name’, fillfactor)

Periodically (weekly or monthly) perform a database reorganization on all the indexes on all the tables in your database. This will rebuild the indexes so that the data is no longer fragmented. Fragmented data can cause SQL Server to perform unnecessary data reads, slowing down SQL Server’s performance.
If you perform a reorganization on a table with a clustered index, any non-clustered indexes on that same table will automatically be rebuilt.

Database reorganizations can be done by scheduling SQLMAINT.EXE to run using the SQL Server Agent, or if by running your own custom script via the SQL Server Agent (see below).


–Script to automatically reindex all tables in a database
USE DatabaseName –Enter the name of the database you want to reindex
DECLARE @TableName varchar(255)
DECLARE TableCursor CURSOR FOR
SELECT table_name FROM information_schema.tables
WHERE table_type = ‘base table’

OPEN TableCursor
FETCH NEXT FROM TableCursor INTO @TableName
WHILE @@FETCH_STATUS = 0

BEGIN
PRINT “Reindexing ” + @TableName
DBCC DBREINDEX(@TableName,’ ‘,90)
FETCH NEXT FROM TableCursor INTO @TableName
END

CLOSE TableCursor
DEALLOCATE TableCursor

The script will automatically reindex every index in every table of any database you select, and provide a fill factor of 90%. You can substitute any number you want for the 90 in the above script.

When DBCC DBREINDEX is used to rebuild indexes, keep in mind that as the indexes on a table are being rebuilt, that the table becomes unavailable for use by your users. For example, when a non-clustered index is rebuilt, a shared table lock is put on the table, preventing all but SELECT operations to be performed on it. When a clustered index is rebuilt, an exclusive table lock is put on the table, preventing any table access by your users. Because of this, you should only run this command when users don’t need access to the tables being reorganized.

DBCC SHOWCONTIG

DBCC SHOWCONTIG (Table_id, IndexID)


Used to show how fragmented data and indexes are in a specified table. If data pages storing data or index information becomes fragmented, it takes more disk I/O to find and move the data to the SQL Server cache buffer, hurting performance. This command tells you how fragmented these data pages are. If you find that fragmentation is a problem, you can reindex the tables to eliminate the fragmentation. Note: this fragmentation is fragmentation of data pages within the SQL Server MDB file, not of the physical file itself.

Since this command requires you to know the ID of both the table and index being analyzed, you may want to run the following script so you don’t have to manually look up the table name ID number and the index ID number.

–Script to identify table fragmentation
–Declare variables
DECLARE
@ID int,
@IndexID int,
@IndexName varchar(128)

–Set the table and index to be examined
SELECT @IndexName = ‘index_name’ –enter name of index
SET @ID = OBJECT_ID(‘table_name’) –enter name of table
–Get the Index Values
SELECT @IndexID = IndID
FROM sysindexes
WHERE id = @ID AND name = @IndexName

–Display the fragmentation
DBCC SHOWCONTIG (@id, @IndexID)

While the DBCC SHOWCONTIG command provides several measurements, the key one is Scan Density. This figure should be as close to 100% as possible. If the scan density is less than 75%, then you may want to reindex the tables in your database.

DBCC SHOW_STATISTICS

DBCC SHOW_STATISTICS (table_name, index_name)

Used to find out the selectivity of an index. Generally speaking, the higher the selectivity of an index, the greater the likelihood it will be used by the query optimizer. You have to specify both the table name and the index name you want to find the statistics on.

DBCC UPDATEUSAGE

DBCC UPDATEUSAGE (‘databasename’)

The official use for this command is to report and correct inaccuracies in the sysindexes table, which may result in incorrect space usage reports. Apparently, it can also fix the problem of unreclaimed data pages in SQL Server. You may want to consider running this command periodically to clean up potential problems. This command can take some time to run, and you want to run it during off times because it will negatively affect SQL Server’s performance when running. When you run this command, you must specify the name of the database that you want affected.

Removing duplicate records from a table where the rows are similar but still distinct.

We had an issue with a supplier table which contained records that were quite similar but still distinct.

For example this:

Supplier ID                 Supplier Number         Company ID      Name                                        Address

5                                          3                                      COMP1                BLACK AND DECKER          UK, London

6                                          3                                      COMP1                BLACK AND DECKER          United Kingdom, London

7                                          3                                      COMP2               BLACK AND DECKER          UK, London

8                                          3                                      COMP2               BLACK AND DECKER           United Kingdom, London

 

We want to remove records 6 and 8 but keep 5 and 7.

To do this, we wrote the following query:

DELETE from tblSuppliers
where exists (select ‘x’ from tblSuppliers X
where tblSuppliers.Number = X.Number and tblSuppliers.CompanyID = x.CompanyID
and tblSuppliers.SupplierID > X.SupplierID);

Done!

 

Other Methods

Choose one of the following queries to identify or remove duplicate rows from a table leaving only unique records in the table:

Method 1:

SQL> DELETE FROM table_name A WHERE ROWID > (
  2    SELECT min(rowid) FROM table_name B
  3    WHERE A.key_values = B.key_values);

Delete all rowids that is BIGGER than the SMALLEST rowid value (for a given key).

Method 2:

SQL> create table table_name2 as select distinct * from table_name1;
SQL> drop table table_name1;
SQL> rename table_name2 to table_name1;

This method is usually faster. However, remember to recreate all indexes, constraints, triggers, etc. on the table when done.

Method 3 (similar to the one we used):

SQL> delete from my_table t1
SQL> where  exists (select 'x' from my_table t2
SQL>                 where t2.key_value1 = t1.key_value1
SQL>                   and t2.key_value2 = t1.key_value2
SQL>                   and t2.rowid      > t1.rowid);

Note: One can eliminate N^2 unnecessary operations by creating an index on the joined fields in the inner loop (no need to loop through the entire table on each pass by a record). This will speed-up the deletion process.

Note 2: If you are comparing NOT-NULL columns, use the ISNULL function. Remember that NULL is not equal to NULL. This should not be a problem as all key columns should be NOT NULL by definition.