Rebuild could be cure for startup problem on scale and bug flooding, how to make sure it become medicine rather than band aid.
When beginning project was initiated, either you are startup or stable corporation, we tend to make project which have technical debt inside. it could like: don’t have any test, bad naming variable, hard to scale, complicated to modified, took wrong short cut, or anything else which can you drink aspirin everyday to catch customer growth or make minimum viable product which accepted by market. And time is passed and now management want new feature or your customer raise into exponentially growth. Nevertheless your system is hard to adapt either your team busy on fixing never ending bug or found perplexity to adding new stuff.
Suddenly your whole team (and management) agreed to do some refactor plan to overcome those obstacles. But refactor is tricky challenge, it’s hard to see progress as clear as development (some ticket seems not move day to day) and have some drawback like:
- you must allocate dedicated member and time,
- make regressive whole testing,
- hold some new feature,
- make sure recent feature still steady and not broken,
- lack of documentation how recent module work.
In current company we do rewrite not only make it project more robust and clean but also change language and framework, and how we roll it without customer unconsciously disturbed, let’s dive in.
Our progress is start on beginning of 2018.
We move from PHP to Ruby on Rails cause our CTO love ruby. kidding 🙂
We change to ruby cause of several aspect:
- No test on phpUnit, when join, i try to make unit test on php and it long time to separate logic and query builder on current class and it spend long time to prepare input data rather than create logic test.
- No standardisation and code review, yes this is not language sin but it make some code is have duplication (not DRY) and hard to read.
- Hard to maintain, when you change or add new feature it tend to make other feature became broken. We pray to God when we deploy our system still fine on our patch.
- On-the-fly calculation, in some feature, when use option
A
it will read directly from database and otherwise when we use optionB
it need to recalculate using some formula, it make data inconsistent on some area and make customer upset. - Some UI/UX need to adapt.
Disclaimer: i don’t say Rails is better than Laravel. We change over language because on past development developer tend to use any shortcut and we don’t want premature refactor if use same language also we need separate repository to adapt strict pipeline: like linter and running whole testing, before code merge to master branch.
Strategy pattern to rewrite:
- We try replace module 1 by 1 rather than big bang deploy. Bigbang deploy can bring catastrophe like previous module, it needs long time to build among half until several years with unchanged scope, but customer demand can change in several month, also it make big scope to be underestimated on time estimation, and forget side case scenario.
- Use data driven when adopt old module to new one, use how many transaction which generated, more transaction in selected module means important to build, use net promoter score to scan part of module to be improved.
- We use group discussion on new database schema to analyse outcome challenge and capability of proposed database, we even invite product owner to make scenario and make simulation how data is insert and read.
- Use pair programming to initialise core code like model definition and basic service. We do pair programming also for migration script cause it have some many if-else condition cause of abnormal data and different behaviour between old system and new system.
- Make coverage test to more than 90%.
- Use code review and linter to make code become standard. Also implemented on repository pipeline so every code can be tested and standardized.
Our first deploy is login and signup module, we use existing table to avoid migration and make as proof on concept if code can be running on production, we also use same session and oauth to make sure laravel can use login module directly.
And voila it works like charm.
Next step is do refactor on core data like decoupling employee information, role and additional attribute. We don’t change current column on existing table but we leverage on other table. In this step we make new table like profile to save personal identity employee, dependent to save family member and other table which become supplement of employee.
And before deploy in production we found some chaotic moment cause of some unexpected thing we don’t realise. Like:
- both system need redirection url between each system.
- Some data need to be created on old system but it not exist on new one (we create some endpoint to call old system module, to create relevant data which used by laravel)
- some data have anomaly (inconsistent ) because in new system some validation is more strict than old one.
In these case we use transaction to lock data when migrate and use rollback when some error occurred, we did in staging fortunately.
We do on mob programming on co-working place to avoid disturbing and to more focus (we do this about 5 days), and after that our core system replaced in new one.
We monitor and fix some bugs on several day later. Before facing more challenging area, migrate rest module and shutdown php.
We learn some point after last deploy like:
- It cause downtime on all customer (3 hour only in 1 module) and affected to all customer and we want avoid this on next iteration. Imagine next iteration is remaining module (12 * 3 hour downtime is bad scenario)
- There are gap of validation on old data and new data, and since next module is write on new table, it need more test scenario to adapt and convert missing validation.
- Some customer already churn, and we don’t hesitating to migrate this customer on new feature.
- It need dev-ops to run migration process. We ask dev-ops to running migration and we waiting result either success and fail, and when fail we ask log to dev-ops again.
- It one-way street and cannot turn back around when customer feels better on old system. Customer must accept it, like or not.
- We do overtime on weekend, it deteriorate mental and make physical exhausted and can make wrong decision indeed. We want work life balance so we try to avoid this method.
Cause of above point we try to make migrate by selected customer rather than global (whole company). With step:
- Coordinate business development and product owner to select and inform customer which want to migrate on new feature.
- Using regression test using automate script by Software Development Engineer in Test (SDET) to match old data and new data. And when we found some anomaly in some customer, other customer still can be moved.
- Deploy refactoring code on production but only activated on new signup customer to get direct feedback to improve.
- Select company or some company via admin dashboard, so everyone can migrate and call web-hook to flock / slack so progress can be tracked, we also got log when it fail, some migration running on long time on background processing job.
- Freeze whole account on selected member of company and force logout so no data change when migrate to keep data consistent.
- Do migration on one transaction so when some bad occurred old data still save. We do on job with
zero retry
. It’s important to stop retry cause it cause locking database.
https://gist.github.com/kusumandaru/8f1eb6259e49f7178a0ac3fa94e6921b
- Make some feature toggling so when migration done any redirection and service switch and new one. We use redis to make IO read fast.
- Set versioning migration so we can track company status migration of each company.
- Button to rollback when needed, to switch back to old system. So when customer still missing old system, or some new system not work, we can accommodate.
And before customer migration, we do some improvement like:
- automatic deploy (continuous delivery and continuous integration) so we can deploy every time code merge to production. see this link
- Make code already in production but not active with feature toggle. (Deploy code on dormant mode).
- Do scrum routine like daily standup to catch problem early, create Pull Request so everyone can understand proposed change and catch some edge case defect and make code have more quality