Improving the reliability of Saros using Root Cause Analysis
worked on by:
Sebastian Starroske
I continued the Root Cause Analysis after handing in my Master Thesis. I will publish the results on this website on Sunday, February 10.
Current Version:
RCA
Outline
In dieser Arbeit geht es, einige wichtige grundsätzliche Schwächen des Saros-Produktes aufzudecken, die die Stabilität/Vermeidung von Inkonsistenzen betreffen und dabei zugleich (soweit möglich) herauszufinden, welche Änderungen am Saros-Entwicklungsprozess künftig ähnliche Schwächen zu verhindern helfen sollten bzw. dabei helfen, die aktuellen Schwächen gründlich und dauerhaft abzustellen.
Der Zugang erfolgt dabei über Defektkorrekturen: Es werden der Reihe nach ein paar Defekte aus der Defektdatenbank ausgewählt, im Code lokalisiert und dann aber nicht nur einfach behoben, sondern zusätzlich daraufhin analysiert, welche Produkteigenschaften vermutlich dazu geführt oder beigetragen haben, dass sie aufgetreten sind, und welche Prozesseigenschaften entweder für diese Produkteigenschaften verantwortlich sind oder aber verhindert haben, dass trotz dieser Produkteigenschaften der Defekt vermieden werden konnte.
Im Zuge dieser Analyse wird also eine Kette von Ursachen und Wirkungen identifiziert, die qualitätswichtige Zusammenhänge im Softwareprozess beschreibt. Die höherrangigen Ursachen in dieser Kette nennt man auch Urgründe (root causes) und diese sind ein wertvolles und erprobtes Hilfsmittel für Prozessverbesserungen.
Darauf aufbauend soll der Kern dieser Arbeit darin bestehen, strukturelle Verbesserungen sowohl im Produkt als auch im Prozess zu identifizieren, die möglichst viele dieser Urgründe abstellen, und einige davon ganz oder teilweise umzusetzen.
Thesis Requirements
- Improve consistency of Saros
- Name and describe all current inconsistency types occuring in Saros
- Determine risk and probability of all types
- Understand the causes of the types which have the highest risk value
- Performing a Root Causes Analysis (RCA) for the types with thwe highest risk value
- analysis the results regarding coverage
- analyze tools and methode and create a short handbook / documentation how RCA should be performened in the future of the Saros project
- Present / implement possible solutions for the found root causes
- nice-to-have: Analyze how the solved root causes reduce other non-inconsitency problem
Milestones and Planning
A milestone is a scheduled event signifying the completion of a major deliverable or a set of related deliverables.
A milestone has zero duration and no effort -- there is no work associated with a milestone. It is a flag in the workplan to signify some other work has completed.
Usually a milestone is used as a project checkpoint to validate how the project is progressing and revalidate work.
(Source:
http://www.mariosalexandrou.com/definition/milestone.asp)
Milestone no. |
Milestone |
Goals |
Past |
CW |
accomplished |
1 |
Register Thesis |
literature research working on welcome checklist getting to know Saros planning the project |
|
CW27 |
not accomplished - CW28 |
2 |
Concept Presentation |
further literature research clustering of bugs/events determining risk value for each bug /cluster (occurence * consequences) outline of thesis PSP |
|
CW31 |
not accomplished - CW 34 started working on RCA I |
3 |
Presentation of a an detailed schedule |
time scheduling |
|
CW33 |
accomplished |
4 |
RCA I |
gathering data through fixing errors try to identify first root causes |
|
CW39 |
in progress - currently performing last steps |
5 |
RCA II |
identifaction and fixing of root causes statistical analysis on coverage |
|
CW45 |
in progress |
6 |
Hand in thesis |
finishing thesis finishing presentation finish open tasks |
|
CW50 |
|
…
Weekly Status
Week 3 (CW 23)
Activities
Week 4 (CW 24)
Activities
- worked on clustering the bugs from the bug tracker
- JarSync
- Literature
Results
- possible way of clustering bugs / events / phenomena:
- Inconsistency and Invitation
- Inconsistency and Network / Protocol / internal Read-Only
- Inconsistency and file / directory or SVN operation (OS level)
- User Read-Only Mode
- GUI missbehaviour
Next Steps
- clustering
- evaluating risk and prioritizing
- steps for registering thesis
Problems
Week 5 (CW 25)
Activities
- work on clustering the bugs from the bug tracker
- JarSync
- Analyzing reproducibility of the bugs
- Evaluating risk (occurrence and impact)
Results
- updated clustering:
- Inconsistency and Invitation (10 entries)
- Basic Inconsistency - Recovery (12 entries)
- Basic Inconsistency - Partial Sharing (2 entries)
- Basic Inconsistency - Communication (6 entries)
- Follow Mode (12 entries)
- Inconsistency and file / directory or SVN operation (OS level)(8 entries)
- User Read-Only Mode (7 entries)
- GUI missbehaviour (6 entries)
- finished Risk Analysis and priorisation of clusters
- Inconsistency and Invitation
- Basic Inconsistency - Recovery
- Inconsistency and file / directory or SVN operation (OS level)
- User Read-Only Mode (7 entries)
- the other clusters don't have many serious open bugs (most of them are alreay closes)
Next Steps
- start with cluster: Inconsitency and Invitation
- comprehend the causes of already closed bugs and understand how they have been fixed
- from those bugs, go deeper to find more underlying causes / contributing factors or try to apply the knowledge gained to fix open bugs
Problems
Week 6 (CW 26)
Activities
- work on bugs 3458952, 3512804 and 3300579
Results
- bugs were understood and could be reproduced
- 3300579 could be reopened, because it still exists in the curretn version of Saros
Next Steps
- RM during the next week
- Try to reproduce and fix bug 3489409
Problems
Week 8 (CW 28)
Activities
Week 9 - 13 (CW 29 - CW 33)
Activities
- work on 3541540 Activity queuing is broken during synchronization
Results
- uploaded first patch in CW 33
- changed activity queuing: all activities are now queued in project specific Blocking Queues and then executed by Dispatcher Threads
Next Steps
- Preparing concept presentation
Problems
Week 14 (CW 34)
Activities
- work on 3541540 Activity queuing is broken during synchronization
- preparing concept presentation
Results
- presentation took place on August 23rd
Next Steps
- continue working on 3541540
Week 15 (CW 35)
Activities
- still working on 3541540 Activity queuing is broken during synchronization
Results
- stable version checked in on September 3rd
- the dispatcher Thread problem on Unix systems was also tested successfully with this patch
Next Steps
- analyzing how this patch fixes the sub entries (in SF)
- analyzing since when the problem with the wrong Activity queuing occured the first time
Week 16 (CW 36)
Activities
- analyzing how this patch fixes the sub entries (in SF)
- analyzing since when the problem with the wrong Activity queuing occured the first time
Results
- activity queuing was never flawless, but at the beginning of the project it was really hard to exploit this and cause failures
- 2 related bugs in SF are partly fixed with this patch, one can not be tested and one bug seems not to be related to #3541540
Problems
- a problem was detected in the patch for fixing the activity queuing: Activities sent before and partly while the projetc archive is created cuase inconsistencies
- watchdog can not detect those inconsitencies
Next Steps
- work on the problems discribed above
Week 17 - 18 (CW 37-38)
Activities
- working on the activity queuing
Results
- possible solution was found, but not implemented yet
Problems
Next Steps
Week 19 - 23 (CW 39-43)
Activities
- working on the activity queuing, in specific on the problem that not all Activities need to be sent to the invited person during the invitation process, since they might already be included in the archive
Results
- problem could be fixed
- the OutgoingInvitationProcess now has a Set containing all files, which have already been packed in the archive
- this is used by the SarosSession to determine, if an Activity needs to be sent
Problems
- Eclipse and Filesystem were out of sync --> changes caused by Activities were not included in the archive
Next Steps
- check if ReadOnly can be disabled during invitation
- make a detailed description of the patches and give hints for reviewers
Week 24 (CW 44)
Activities
- working on the description of the patches
- collecting information about the solution process
- analyzing ReadOnly
- working on an outline
Results
- review description was sent to DPP-DEVELOP
- created outline and collected information for some of the chapters
- implemented a stress test, where multiple files are edited during an invitation process
Problems
- Inconsistencies can occur when a non-host invites
Next Steps
- forbid non-host invitation
- working on the introduction