SQL join help needed

joblo37pam · Jun 21, 2009

I'm self-taught in SQL, so please forgive the crudeness of what I'm trying to do.

I manage an asp.net app with a MS SQL 2005 database that is used to enter event registrations for an organization I belong to. Each entry becomes a new row in the database, and that all works great. The only problem I'm having is building a report that does what i want it to. Some of the events require a partner, so when an entry for those events is made, the contestant is asked to enter their partner's name so that they can be matched together later. This is where the problem comes in.

Since the contestant's own name is separated into first and last name fields (ContFirstName, ContLastName), but they enter their partners name in a single field (PartnerName), it is difficult to programatically match the partners entries together. I have come up with a query that will basically do what I want as long as the contestant spells their own name the same way that their partner does, but am having a hard time trying to get the mispelled ones matched. Here is what I have:

Code:

SELECT parent.PartnerName, child.FirstName + child.Lastname FROM Entries as parent LEFT JOIN Entries as child ON (rtrim(parent.partnername)) = (rtrim(child.FirstName) + ' ' + rtrim(child.LastName))

I understood LEFT JOIN would return all of the columns in the left table, even if there isn't a match in the right table. That isn't happening for some reason. Currently, there are 38 entries in the table for a partner event, but this query only returns the 32 that are spelled correctly and ignores the rest. Even if I could get the query to return a null or empty string if it doesn't find a match, I could work with that in the asp code. I just can't seem to figure out why the left join isn't returning all of the rows.

Any help would be appreciated. If you need more information, just ask. Like I said, I'm a noob at this, so I'm not sure what I'm missing. Thanks.

calebb · Jun 22, 2009

Are you aware that rtrim and ltrim only removes spaces? Do you have a lot of stray spaces in your data? Are you intending to use RIGHT()? http://msdn.microsoft.com/en-us/library/ms177532.aspx

To help you understand what the LEFT OUTER JOIN is doing, run these queries: (the same as your queries, but with different SELECTs)

Code:

SELECT 
	--parent.PartnerName, 
	--child.FirstName + child.Lastname 
	parent.*
FROM Entries as parent 
LEFT OUTER JOIN Entries as child ON (rtrim(parent.partnername)) = (rtrim(child.FirstName) + ' ' + rtrim(child.LastName))


SELECT 
	--parent.PartnerName, 
	--child.FirstName + child.Lastname 
	child.*
FROM Entries as parent 
LEFT OUTER JOIN Entries as child ON (rtrim(parent.partnername)) = (rtrim(child.FirstName) + ' ' + rtrim(child.LastName))

Now, I'm still not exactly sure what your data looks like... and I'm not sure what your performance requirement are as well as capabilities of your SQL server. However, I'll assume you're not running a super high volume server and you're not pushing the envelope with bare-bones specs for processor/memory.

One possibility is to enable full-text indexing and use CONTAINS:
http://msdn.microsoft.com/en-us/library/ms187787(SQL.90).aspx

Where are the mispellings typically? After the first few letters of the child's name? You could add yourself a simply split UDF, split the child's name on the space and then directly compare the first 3 letters of the firstname and the first 3 letters of the lastname.
http://geekswithblogs.net/AngelEyes/archive/2007/04/12/111504.aspx

CHARINDEX in conjunction with LEFT() and RIGHT() might be another possibility: http://msdn.microsoft.com/en-us/library/ms186323(SQL.90).aspx

this would result in a query like this. (I use a derived table so I don't have to have the messy CHARINDEX() stuff listed again in the JOIN

Code:

SELECT
	*
FROM
	ENTRIES E1
INNER JOIN
	(SELECT
		LEFT(recipFullNAme, CHARINDEX(' ',PartnerName)-1) AS FirstNameP,
		RIGHT(recipFullName, CHARINDEX(' ',PartnerName)+1) AS LastNameP,
		P.*
	FROM 
		Entries P) AS ExpandedEntries
	ON ExpandedEntries.FirstNameP = E1.FirstName AND ExpandedEntries.LastNameP = E1.LastName

And then an improved version of this might be:

Code:

SELECT
	*
FROM
	ENTRIES E1
INNER JOIN
	(SELECT
		LEFT(recipFullNAme, CHARINDEX(' ',PartnerName)-1) AS FirstNameP,
		RIGHT(recipFullName, CHARINDEX(' ',PartnerName)+1) AS LastNameP,
		P.*
	FROM 
		Entries P) AS ExpandedEntries
	ON LEFT(ExpandedEntries.FirstNameP,3) = LEFT(E1.FirstName,3) AND LEFT(ExpandedEntries.LastNameP,3) = LEFT(E1.LastName,3)

...so that only the first 3 digits are compared.

Now, if you have problems like some people add a middle initial for their child... you'll need a more complicated query (or some .NET magic). (i.e., count how many "spaces" there are in the PartnerName - and then use a CASE or IF BEGIN to do a different query in these cases...

Good luck!

PTNL · Jun 22, 2009

Maybe try adjusting your query as such:

Code:

SELECT p.PartnerName, (c.FirstName + ' ' + c.Lastname) AS CompetitorName
FROM Entries p
RIGHT OUTER JOIN Entries as c ON ((rtrim(p.PartnerName)) = (rtrim(c.FirstName) + ' ' + rtrim(child.LastName)))

While we don't have enough info to fully evaluate the app/needs, I'm really left thinking that alot (though not all) of your data matching issues could be eliminated by adjusting the user interface. Why not have the users enter the participant's first and last name in separate fields, instead of meshing into one? That change alone could be really useful in making data fixes to ensure proper matches are made; or at least make it easier to write a script to see what discrepancies need human inspection.

While the RTRIM and LTRIM SQL functions are used for good measure, I hope that you are trimming all string data in the ASP.Net pages or business layer prior to it being sent to your INSERT or UPDATE statements.

Edit: Just saw a reply to this thread after I posted this message...

joblo37pam · Jun 22, 2009

I am aware that the ltrim and rtrim functions only trim spaces. Some of the contesant data is imported into the table via an excel spreadsheet from a secretary, and isn't always clean. In order to keep strings clean without running a query against the table every time data is imported, I try to make sure that I trim all of the strings whenever I call them just to be on the safe side.

As far as performance is concerned, this is really a pretty light app. Resources aren't great, as the SQL database is on a shared hosted server, but it performs as well as we need it to for now.

I think using full-text and 'contains' would definitely help, but still wouldn't grab everything, so it doesn't really solve the problem. I'm honestly not expecting a query to match everything for me, but I need to know when a match isn't made so that I can fix it.

Here is an example of the table (simplified) to show you what I am dealing with:

Code:

EntryID     ContFirstName     ContLastName    City     Event1     Event2     Event3    PartnerName
---------------------------------------------------------------------------------------------------------------------------------------------
1         John               Doe               Here    FALSE    FALSE    TRUE     Jane Doe
2         Jane               Doe               There   TRUE      FALSE    TRUE     John Doe

So every team will have two rows in the table, one for each partner. I am trying to use this query to return both partners (along with their citys) in one row to be printed in a program.

The spelling discrepancies can be just about anything (Luke instead of Lucas, transposed letters, nicknames, etc.) I really don't expect the query to be able to match those types of things, but I do need it to return a null or something when it can't match them. Then I can go back and manually correct them.

Maybe I'm thinking about this backwards. Maybe I need to set up a page to display the names (sorted alphabetically) first so I can fix spelling discrepancies before I run this query.

calebb · Jun 22, 2009

joblo37pam said:
I am aware that the ltrim and rtrim functions only trim spaces. Some of the contesant data is imported into the table via an excel spreadsheet from a secretary, and isn't always clean. In order to keep strings clean without running a query against the table every time data is imported, I try to make sure that I trim all of the strings whenever I call them just to be on the safe side.

As far as performance is concerned, this is really a pretty light app. Resources aren't great, as the SQL database is on a shared hosted server, but it performs as well as we need it to for now.

I think using full-text and 'contains' would definitely help, but still wouldn't grab everything, so it doesn't really solve the problem. I'm honestly not expecting a query to match everything for me, but I need to know when a match isn't made so that I can fix it.

Here is an example of the table (simplified) to show you what I am dealing with:

Code:

EntryID ContFirstName ContLastName City Event1 Event2 Event3 PartnerName --------------------------------------------------------------------------------------------------------------------------------------------- 1 John Doe Here FALSE FALSE TRUE Jane Doe 2 Jane Doe There TRUE FALSE TRUE John Doe

So every team will have two rows in the table, one for each partner. I am trying to use this query to return both partners (along with their citys) in one row to be printed in a program.

The spelling discrepancies can be just about anything (Luke instead of Lucas, transposed letters, nicknames, etc.) I really don't expect the query to be able to match those types of things, but I do need it to return a null or something when it can't match them. Then I can go back and manually correct them.

Maybe I'm thinking about this backwards. Maybe I need to set up a page to display the names (sorted alphabetically) first so I can fix spelling discrepancies before I run this query.

CONTAINS does similarity matches. You choose how 'fuzzy' the match is. It sounds like you would just choose a sufficiently fuzzy match that Lucas matches Luke.

Or... in the case of Like vs Lucas, the last query I posted before would work, but use LEFT(...,2) instead of 3 in the join.

(The downside is would match anyone where the first two letters of their first and last name match... Luke Anderson with Lucas Andes... but since you said this is a small system, I would bet you will never have names that are this similar. Or even better, alter it such that the first 5 digits of the last name have to match but only the first 2 letters of the first name have to match. i.e., LEFT(...,2) and LEFT(...,5))

Also, you should use LTRIM(RTRIM( if you want to get rid of stray spaces on both sides of the data.... or run a scheduled query to clean up the data. (UPDATE NAMES SET FirstName = LTRIM(RTRIM(FirstName)))... or you could fix the Secretary's Excel spreadsheet so she can't enter leading/trailing spaces. (the main reason for this, IMO, is it makes your queries less maintainable when everything is wrapped in LTRIM(RTRIM()))

joblo37pam · Jun 22, 2009

How much more overhead does CONTAINS take?

I didn't even think about scheduling a 'clean up' query. That could come in handy and could make all of my code a lot cleaner.

The spreadsheet actually doesn't have any trailing spaces, but when you import it into access as a frontend for the database, the spaces start showing up. It's generally not a problem, except in cases like this. The scheduled clean up may be the easiest way to fix it.

Pwyl_The_Destroyer · Jun 22, 2009

PTNL said:
, I'm really left thinking that alot (though not all) of your data matching issues could be eliminated by adjusting the user interface. Why not have the users enter the participant's first and last name in separate fields, instead of meshing into one?

I haven't seen the OP respond to this comment. It seems that cleaning up the data model would be better than writing fuzzy matching queries. I would further think that if all partners must be defined in the table, that you would keep the EntryId of the partner in the Partner field, and refactor the UI to allow for choosing an existing entry as a partner.

joblo37pam · Jun 22, 2009

Pwyl_The_Destroyer said:
I haven't seen the OP respond to this comment. It seems that cleaning up the data model would be better than writing fuzzy matching queries. I would further think that if all partners must be defined in the table, that you would keep the EntryId of the partner in the Partner field, and refactor the UI to allow for choosing an existing entry as a partner.

Separating the fields isn't going to fix spelling discrepancies, which are really the main issue. It really won't do anything that the first query I posted won't. All that it would accomplish is frustrating users who are used to being able to type a name into a single field. Those users also won't know the EntryID for their partner. Letting them choose an existing entry is a good idea, but will only work for the second partner, obviously.

All I really want to do with my original query is have the results show up like this:

Code:

Contestant1 - City   |    Partner1 - City
Contestant2 - City   |    Partner2 - City
Contestant3 - City   |    null
Contestant4 - City   |    Partner4 - City

Where Contestant3's partner isn't found. The way it work right now, Contestants 1,2, & 4 are returned with their partners correctly, but Contestant3 is completely missed. I want Contestant3 to be returned, even if the partner is not found.

PTNL · Jun 22, 2009

joblo37pam said:
Separating the fields isn't going to fix spelling discrepancies, which are really the main issue. It really won't do anything that the first query I posted won't. All that it would accomplish is frustrating users who are used to being able to type a name into a single field. Those users also won't know the EntryID for their partner. Letting them choose an existing entry is a good idea, but will only work for the second partner, obviously.

I agree that having the user select a person's name from a list is not a good idea. Plus, it opens up some potential privacy issues. It would be different if you were talking about a team signup, where a user registers and selects which team to signup for.

However, what level of backlash would you get for having the user split out someone's first and last name? Every secure and reputable site imaginable has separate fields for first and last name. This approach is not the least bit uncommon.

It just seems that some of the data schema decisions were based solely on the UI perspective (presentation), instead of cleaner data hierarchy and categorization.

joblo37pam said:
All I really want to do with my original query is have the results show up like this:

Code:

Contestant1 - City | Partner1 - City Contestant2 - City | Partner2 - City Contestant3 - City | null Contestant4 - City | Partner4 - City

This is kinda what I was imagining from the first post. It did not factor the city, but another INNER or OUTER JOIN (depending on your schema) should get the city in there as well. Though given the uncertainty of successfully matching on an entered participant's name, I think an OUTER JOIN would likely be better.

joblo37pam · Jun 22, 2009

I already have it working with the city information, it's just a matter of returning all of the information, whether it has a match or not.

As for the separation of first and last names, I understand what you're saying and agree with you. People just don't like change. They would get over it, but I honestly don't see where we would gain anything with the problem at hand.

maxedoutcc · Jun 22, 2009

can you post your latest working script?

joblo37pam · Jun 22, 2009

Wouldn't you know it, I got it working. I don't even know what I changed. I just started from scratch and typed it all out. It is returning nulls for those that don't match, just like I want. here is the code (I am posting the full-unsimplified version this time):

Code:

SELECT Parent.FirstName, Parent.LastName, Parent.City, Child.FirstName, Child.LastName, Child.City FROM Entries as Parent Left Outer Join Entries As Child On rtrim(Parent.PartnerName)=(Rtrim(Child.FirstName) + ' ' + Rtrim(Child.LastName)) WHERE Parent.Event3='TRUE'

Now I can at least see which ones aren't matching and go edit them accordingly. Then the program will be correct without any manual editing.

Thanks for all of the help and tips, guys. I used to hate writing code, but when things work like you want to, it's actually kind of fun.

SQL join help needed

joblo37pam

2[H]4U

calebb

[H]F Junkie

PTNL

Supreme [H]ardness

joblo37pam

2[H]4U

calebb

[H]F Junkie

joblo37pam

2[H]4U

Pwyl_The_Destroyer

Limp Gawd

joblo37pam

2[H]4U

PTNL

Supreme [H]ardness

joblo37pam

2[H]4U

maxedoutcc

Gawd

joblo37pam

2[H]4U