SnTT, kinda: Interesting replication problem, and I wanted to get your ideas
Category Lotus Show-n-Tell Thursday TechnicalBookmark :
This is a technical post of sorts, and even though I'm not "showing" a solution or trick, I am "showing" a problem I encountered with the intention to use the "grid computing" power of our collective knowledge and experiences. Also keep in mind that I want to keep this as generic as possible, so that I can get an "untainted" set of opinions. Here's the situation.
John Coolidge (of the MS Constants database fame) and I were at a customer site, and there was a problem with replication of design elements. After some design elements were copied into the database programmatically (via published APIs, nothing tricky or experimental - the UNID was preserved via this process, also a part of the API), they wouldn't replicate at all, anywhere, until you edited the elements manually in Domino Designer. Also there is no replication error thrown at all.
Incidentally this technique has worked at many, many, many customer sites, so there's nothing wrong with the process we're using; this is the first customer for which the following problem as occurred.
All servers are in the same location, same time zone.
OK, back to the problem at hand. In order to rectify the problem we tried:
- Clearing replication history of both the source and target replicas
- Testing this technique on different databases
- Compacting, Fixing, Updating, etc. - all the stuff you'd normally do to fix a corrupted db
- And probably a few other things that I'm just not remembering - but I'm sure I'll remember when you ask about it
None of this work - the problem persisted. However, I had a sneaking suspicion what the problem may be, and after some investigation we had an idea as to why this was happening.
Now for those of you not familiar with replication, I think it is important for you to understand exactly how it works before moving on; therefore let me give you a quick primer on it, so that you have the same base knowledge from which I am working.
Replication Background NOTE: You can find the gorpy details of all of this stuff I am talking about here, but I will pick out the salient points for this discussion. First, let me explain about the Originator ID (OID).
Originator ID (OID)
The OID is comprised of:
- Universal ID (UNID) which (as you know) is the unique identifier for a note (remember, a "note" can be either a document [data] or design element [design]) in a database
- the sequence date (SD), which is a hex version of a date/time stamp
- the sequence number (SN), which is a hex number representing the number of times the note has been saved
The SD and the SN are updated every time the note is "dirtied" (i.e. a change is made to the note) and saved. The UNID never changes after a note is created and saved.
Take a look at the graphic below:
This happens to be the properties box for this document. The first part that is highlighted, in blue, is the Sequence Date (SD); the second part that is highlighted, in red, is the Sequence Number (SN). So, when I took that screen capture I had saved this document once (the SN always begins at 1).
The replicator uses the OID to determine if two replicas of the same note are the same, or if one (or both) of them has been changed. Let's talk about that a bit more.
Replication - How it Works
Here's an explanation from the aforementioned document from Lotus on how the replicator uses the OID for replicating notes, since it explains it better than I probably would.
NOTE: In the explanation below the term "Sequence Time" is used instead of "Sequence Date". I was always taught Sequence Date (because of the SD abbreviation), so that's what I use.
The UNID, the OID, and the Replicator**Emphasis mine
The Universal Note ID (the first half of the Originator ID) uniquely identifies all versions and all copies of the same note. Two notes are replica copies of each other if they share the same UNID. Therefore, different versions and all replica copies of the same note have the same UNID. A corollary of this rule is that one database must not contain two notes with the same UNID. If the replicator finds two notes with the same UNID in the same database, it generates an error message in the log and does not replicate the document.
The full Originator ID, on the other hand, uniquely identifies one particular version of a note. In other words, all replica copies of the same version of a note have the same OID. However, a modified version of a replica copy of a particular note will have a different OID, because Domino and Notes increment the sequence number when a note is edited and also sets the sequence time to the timedate when the sequence number was incremented. Therefore, when one replica copy of a note remains unchanged but another copy is edited and modified, the UNIDs of the two notes remain the same but the sequence number and sequence times (and therefore the OIDs) are different.
The Domino replicator uses the UNID to match the notes in one database with their respective replica copies in other databases. For example, if database A is replicating with database B, and database A contains a note with a particular UNID but database B does not, the replicator creates a copy of that note and add it to database B.
If database A contains a note with a particular UNID and database B contains a note with the same UNID, the replicator concludes that these two notes are replica copies of one another. In this case, the replicator goes on to examine the sequence number and sequence time of the two notes. If the sequence number and sequence time are the same for both notes, then the replicator concludes that the two notes are up to date with one another, and no action is required. On the other hand, if either the sequence number or the sequence time -- or both -- differ between two notes, the replicator must decide which one is more recent and update the older note with the most recent version.
If one note has been updated but the other note has not, the sequence number of the first note will be greater than that of the second. The replicator handles this case by overwriting the second note with the first, bringing the two databases into synchronization.
Now, we could go on to explain field-level replication, replication conflicts, and so on, but we don't need to. Why not? Because we're talking about design elements, and with design elements the newest one wins.
Or so I thought.
My Problem - the Nitty Gritty
Based upon the replication explanation and how the OID is used, we can make the following simplistic assumptions about the various conditions for replication. In my examples below I am going to use regular date/time and decimal expressions to represent the SN and SD. Assume that I'm talking about replicas of the same design note.
| Design Note 1 (SD : SN) | Design Note 2 (SD : SN) | Results and Comments |
|---|---|---|
| 13 June 2008 : 03 | 13 June 2008 : 05 | SD are the same, note 2's SN is greater, so note 2 wins |
| 15 June 2008 : 12 | 13 June 2008 : 05 | note 1's SD is greater, so note 1 wins |
| So far, so good. But when investigating this problem we had, here's the situation we found... | ||
| 13 June 2008 : 12 | 25 Jun 2008 : 04 | Notice that note 1's SD is lower than note 2's, but the SN is higher |
As I stated before, the third example is what we found in our investigation. If everything is working correctly, this shouldn't be possible - to have a SD lower but the SN higher between two notes. I am not sure how this happened, and the customer had no idea either. We asked about maybe there being a server time issue at some point, but there was no such problem noted by the customer.
Now, we're not sure that this is the cause of the problem - but it is our best guess. The only part of this explanation that bothers me is that with design element notes I always thought "last one in wins", period - so if that is the case, shouldn't note 2 win in the example listed above? My guess is that the replicator still evaluates the entire thing - SD and SN - even with design elements; it just doesn't create a replication/save conflict, it simply picks the "winner" and ensures that it is in both replicas. But in this case, it couldn't make a determination as to which note should win, so the replicator did the equivalent "huh??", threw its hands in the air, and ignored the problem.
However, if this is the case, shouldn't a replication error be thrown?
Conclusion
The way to resolve this problem is to create new replicas of the databases from the most recent replica, in order to get the SD and SN numbers in sync again. This fixes the problem, and life is good again.
Well, there you have it. I would love to hear your theories, answer your questions, and generally discuss this weird anomaly to see if we can figure out why this happened. In any case maybe this post will help others in the future if they encounter the same problem.
So share your ideas, and let's discuss.
Rock

The GetNth version of this code took over twice as long to process the same documents - 146.07 seconds! - while the ProcessDocs sub took 145.93 seconds. Let's take a look at the breakdown of the ProcessDocs sub.
Category
Category 





) there was a response from Rich Schwartz mentioning that it may be nice to have an RSS button for Show-n-Tell as well. So, based on Rich's suggestion, here's my version:
.
Category
Category






Blog Roll








