Friday, March 27, 2009

My columns are goofed!!!

I have a table in production and dev environments that is not getting replicated data via GoldenGate replication properly. I have spent the last 2 weeks on and off the phone with support. We have tried this way and that way to get the data to pump from source to destination. All without fail.

2 days ago, the support tech suggested that we create a new table, and map to it on the destination, and see if we can pump the data from source to destination that way. Something had to be wrong with the original destination table. I thought this was an odd test, but in fairness, figured anything is worth trying. I will often do silly things to test that i may not think would or should work, if for no other reason than to eliminate them from the possible choices.

So, I create another table, and setup replication to map to this new table. I update a record, and viola! it worked... Wild. Ok, so what's the difference between these tables? Where are the discrepancies? I start digging in to see whats different, if anything, and find nothing. Other than the table name, i see no obvious differences. There are some constraint names different between them, but thats a SQL naming thing, and can usually be ignored. So, I drop the table, and recreate it with another script. This one from Enterprise Manager (previously created the table with a script from SSMS). Once the table was created, I altered data on the source, expecting to see it flow to the destination, and no. Was denied. Now it wasn't working.

So, I thought I had narrowed it down to something. Different tools had created 2 distinct scripts, something had to be different between these two scripts for this new table. So I dropped the table, and recreated it with the original SSMS script, and repeated my steps to update data. Nothing flowed. Now it wasn't working again.
Much to my dissapointment, and the support techs, we didnt know what had caused the 1 instance of success.

Back to the drawing board, which had now become a brick wall, upon which i had bloodied my forehead through repeated banging. None of which seemed to help, but did help me forget my problems for a moment as I furtively impacted my head to wall.

Later that day, the support tech suggested i execute a query to pull some info out about the Identity field. Guess I should explain about the oddness of this table first...

The table has a primary key of UserName, which is a varchar. It also has an ID field in the field list that is an Identity int field. Why this field is not the PK, i do not know. It is the way it is. This seemed to be causing our problems, having an identity field, not the PK, and another the PK. So, when we mapped this with replication, we needed to use KeyCols and include both these fields, to ensure we had unique constraints for GoldenGate replication to properly apply the data.

Ok, back to the query the support tech requested.


select
col.colid,
col.name,
ColumnProperty(col.id, col.name, 'IsIdentity')
from syscolumns col
where col.id = Object_Id('dbo.users')


This was going to prove to them and me, what fields were Identity fields once and for all. This has caused quite a bit of confusion during the process, as we were constantly confused at the UserName being the PK, and having an ID as an Identity.
So, I run the query on dev environment, and give them the results from both the source and destination table. I did this quickly, as I didnt see the need for it, not the interest. I knew and had repeated what the PK was, and what the Identity was, well, repeatedly. Why the support tech was so worried about this, I do not know. But I am happy he was covering all his bases. For out of this query result, we found an interesting thing.

Look back at the query, you'll notice that we return the colid field with each field. Nothing special about this in the query, in fact, it really didnt even need to be there for the purpose of finding the Identity field. But the numbering, the ordering of the results were key. Colid 1 is UserName. This we expected. Its the first column. The second column in the table is FirstName. Its colid should be [2]. It is not!!! It is [3]. From there on out, the number is consistent.

So, in our table of 16 fields, we have colid 1,3-17, missing colid 2 entirely. A quick check show that this is true in dev as well as in production. A quick check of a new table I create shows the colids in order 1-16. I attempt a new test, and sure enough, the data moves to this new table, with the correct colid numbering sequence.

It would appear that GoldenGate replication needs this colid to be in sequence properly to function properly. This one table seems to have been goofed at sometime, colids all out of whack, and this could be the cause for why replication is having a hard time moving data to it.... Yeah!

The story ends here, as this is as far as I have gotten. I actually stopped working on it to document it as it was fresh in my mind. As I find solutions to this issue, I will update this or post another entry to cover it.

Yeah, though. Up until now, we had no idea what was causing our issue. Tons of attempted tests and supposed resolutions have all proved fruitless. Along the way, I have learned a lot of valuable information about the replication system with GoldenGate. I have been able to gain experience with a lot of different aspects of it, so the entire process has been positive, even though frustrating.

Now to figure out how this happened to this table... search for a solution to it, and even investigate other tables, see if they have missing number sequences... maybe other tables are having issues that we do not know about.

No comments: