Is it easy to document all major process changes in a company for 10 years? Not all companies survive that long as independent entities. I wrote two posts five and two years ago about the evolution of our process, so now it is time to cover the entire decade of 2008 to 2018. I will definitely focus on the last two years in greater detail, since earlier periods were covered in previous articles.
We are still developing Targetprocess, our agile project management software. One side product has failed. One was integrated into Targetprocess. The third is an open-source data visualization library, and we’re not going to commercialize it. This year, we started yet another product called Fibery, with the aim of creating a work management platform. It’s already an exciting project, but 90% of the company is dedicated to our main product — Targetprocess.
|Company Size||Company structure||New Technologies|
|2008||15||One cross-functional Scrum development team||C#, ASP.NET, ASP.NET Ajax, NHibernate|
|2009||22||Two development teams:|
Core (5 Developers, Scrum Master, 3 Testers)
Integration (3 Developers, Team Lead, Tester)
|2011||30||Several mini-teams. We form a mini-team for every Epic. It is fully responsible for Epic implementation.||LINQ (own powerful data framework), jQuery, NServiceBus|
|2013||60||4 larger cross-functional teams||SignalR, Rx|
|2014||72||5 cross-functional teams||ReactJS, CasperJS|
BuildBoard: Scala, Angular.js
|2015||80||5 cross-functional teams, 2 of them fully cross-functional||Redis, Webpack, ElasticSearch, Kibana|
|2016||105||Open Allocation experiment||Node.js, TypeScript, Redux, RabbitMQ, Docker, Kubernetes, Helm, .NET Core|
|2017||110||2 units with 3 cross-functional teams inside each unit||Go, PostgreSQL, Prometheus, Kong API Gateway, Fluentd, Swagger|
Department / Team Structure
The last two years were really interesting in terms of company structure. We ran a major Open Allocation experiment through 2016, and made significant changes in 2017.
Emergency Team is back / Feb 2016
In 2016, we reintroduced Emergency Team. It’s a “2-week-duty-rotation” between teams to fix bugs and implement small improvements. Back in 2014, we decided to focus on new functionality and reduce distraction, but this was a mistake. It appeared that quality decreased and hundreds of small improvements accumulated in the backlog. Emergency Team was an OK practice to handle small product improvements, and we got it back.
Emergency Team -> Teams Competence / May 2016
Within a few months, we decided that the Emergency Team practice is not good enough. Why? It had two intrinsic problems:
- Some emergency problems take more than 2 weeks to solve, and we don’t have a mechanism to implement them.
- The product is large. Sometimes, people on duty don’t have enough expertise to solve current important problems in a reasonable timeframe. They have to learn new areas, and this is inefficient.
We resolved to solve these problems by replacing duty rotation with Teams Competence. We split Targetprocess into areas such as Search, Lists, Reports, Filters, etc. Each development team was responsible for several areas. All requests about specific area improvements went to the responsible team, and the team decided when and how to implement it.
This is what we expected as a result of this practice:
- Bugs will be fixed faster and Improvements implemented faster, since people with the most expertise will work on them.
- Development teams will increase their expertise in specific areas, which is a must-have step for microservices architecture.
- Larger improvements that take more than two weeks will be possible.
- The Product Owner will no longer be a bottleneck, since Teams will be able to make most decisions on their own.
This was introduced in May and everything looked good, but in July 2016 we stopped everything to run a major (and risky) experiment.
Open Allocation Experiment / July 2016
I was always fascinated by the idea of creating an organization without managers. I felt this is the right way to go, since managers don’t add any real value. They don’t write code, don’t invent things, don’t write articles, don’t design, and don’t sell. Why the hell would any company have them? So from the start, we didn’t have dedicated managers. The founders set strategy and fired people, but that was it. Teams did the hiring, we had a peer review process to check performance every 6 months, etc. In 2016, we made one more step towards becoming a completely self-organizing company, and giving people the freedom to form teams and choose what to work on.
We shut down the Product Board and Teams Competence. Instead, the Product Owner set a very high-level theme for the year, like “Focus on pains and misfits in the product”. People had the freedom to start their own Initiatives. A person who starts an Initiative is a Source. An Initiative is a project that generally takes from 2 weeks to 6 months, with a defined team and a strict deadline (the deadline set by the team itself). People were able to join Initiatives or leave them (never happened) at anytime.
This change was inspired by a single thesis: People feel more responsibility and passion when they work on their own initiatives.
We could have enforced responsibility from the top-down—set deadlines and pushed people to the limit to meet them. That’s what most companies do. But this strategy is not aligned with our values.
10 months passed and we evaluated the results of the experiment. What did we learn?
- People were happy with the freedom of choice, and most deadlines were met.
- Most people participated in Initiatives, so it was hard to find developers to fix small annoying things.
- The high level theme was focused on UX and functionality, but missed technical debt. In 10 years, we’ve accumulated huge piles of technical problems, and they went mostly unaddressed.
Number 3 is huge. It seriously affected development speed. You need a complete technical vision to move the product in the right direction, but we didn’t have it. I think this is the top problem that forced us to stop the Open Allocation experiment and give Teams Competence another shot, but on a new level.
Units & Team Competence / May 2017
Targetprocess should become a Platform. This vision formed in March 2017. Platformization means a new technical and product vision, so we’ve split the development force into two Units:
- Core & Infrastructure Unit is responsible for the technical platform: Microservices architecture, major platform services and UI libraries.
- UX & Solutions Unit is responsible for functional solutions based on a new technical platform and various functional improvements.
We’ve formed stable teams inside these Units. It appeared that the Open Allocation practice tore some teams apart, but stable teams work much better. Teams agreed on areas of responsibility, and started to work. Unit Leaders became quite similar to formal managers.
The main propositions we used to build this new structure:
- We are trying to build a flat company. Our goal is to have as few managers as possible. A good balance looks like 1 manager to 20 people.
- We don’t want to have managers that just “manage” people. Every person in a management position should do a real job: provide support, write code, play the feature owner role, write articles, sell, etc.
- We don’t want to have a separate HR department for things like hiring, firing, salary review and conflict resolution. Unit Leaders will handle that.
- We don’t want to have managers inside teams.
The outcome of this change should be improved focus and specialization, faster decisions about immediate problems from customers, and better process improvement practices (they deteriorated with Open Allocation). We’ve been working with this approach for five months already, and so far so good. However, it takes about a year to reflect on such serious changes.
Below is the evolution of our development structure over the past 10 years:
Now let’s dig into details.
|Iterations||Release planning||Features prioritization||User Stories estimation|
|2008||Weekly||Formal release planning meeting. Release has ~2 months duration||Ad-hoc, by Product Owner||Planning poker. Estimate user stories in points|
|2009||None, we switched to Kanban||We do not plan releases||Don't estimate|
|2012||We create a long-term roadmap that shows the future product vision on a very high level (Epics). The roadmap is updated every 3 months.|
|2013||Ad-hoc, by Product Board||Quickly estimate in points without a formal approach|
|2014||Some development teams tried to use iterations again||By Product Board using a formal model|
|2015||Back to Kanban||Ad-hoc, by Product Board|
|2016||None, people start Initiatives when they want.||Freedom. People propose initiatives, PO approves (all)||Publicly committed deadlines for initiatives|
|2017||Still none||~3 months horizon||Strategic Board sets themes and major features, Teams decide about improvements and stories||No estimates|
Estimates -> No Estimates -> Estimates -> Deadlines -> No Estimates -> ?
Estimation is not a goal by itself; it is a mechanism to forecast when a feature will be completed. Unfortunately, this is really hard to achieve. Our usual error rate varies from 1.5x to 2x. Two of the hardest things in the software development process are user story estimation and user story split. 10 years of experience didn’t help. Deadlines set by teams did. Open allocation and publicly committed deadlines helped us to forecast feature completion with precision. Then, we stopped this practice and did not bring estimation back. As a result, we are having troubles with accurate forecasts again.
Deadlines are good when people select they work, but they are not good when you set them from the top. You have to timebox features, but we haven’t found the right way to do that yet.
The history of changes to our prioritization mechanism is quite interesting:
Product Owner -> Product Board -> Formal Model -> Product Board -> Freedom -> Strategic Board.
The Product Owner and Product Board used to prioritize all features, even small. Open Allocation mode replaced prioritization with Initiatives proposed by anyone in the development department. I’d say it worked 50/50. Some initiatives were great, but some are not so important to have.
How can you know whether some particular feature will add great value to the product? It is difficult. Everybody makes mistakes here: battle-scarred product owners, CEOs, committees, and formal models. There is no good way to validate the decision before implementation. Quite often I think that random choice is no worse than expert opinion, but I don’t have enough courage to run this experiment.
When Open Allocation was replaced by Team Competence, we were thinking about resurrecting the Product Board. On the other side, we wanted to preserve some freedom of choice for development teams. The Strategic Board is our the current solution. It does not prioritize all features. Rather, it sets major themes and selects a few top features that should be addressed in the next 6+ months. Teams are free to select them from the list, or add their own smaller features and improvements to work on. It seems like a good balance between aligned vision execution and freedom of tactical maneuvers.
Feature Owner role
Until 2017, we had Feature Owners (FO), but it was not a full time position—just a side role for developers, designers or QA engineers. It became clear that this role demands specific knowledge and skill sets as a full time activity. You can think about the Feature Owner as a Product Owner for a team. This year, the FO is a separate position in every development team. We didn’t hire external people, but let existing people transfer into the FO position.
|Tracking and reporting tools||Time tracking||WIP Limits|
|2008||Task Board, Iteration burn down, release burn down||We track spent and remaining time on tasks and user stories||Time-boxing|
|2009||Kanban Board, Cycle time||We have a limit of 3 user stories or bugs in progress.|
|2011||Kanban Board, Cycle time, Builds board||We don't track time||Flexible limits, not defined clearly|
|2012||Kanban Board, Team Board, Cycle time, Builds board, live roadmap on the wall||2 user stories in WIP per developer (1 is recommended). In general, a mini-team decides for itself.|
|2014||Interactive roadmap, Kanban Board, live roadmap, various reports||No clear WIP definition, people decide how to work|
|2017||Company portfolio Timeline||Still no WIP limits. We can do better here.|
Net Promoter Score became the most important metric in our company. Maybe it is not perfect, but we haven't found a better metric for customer satisfaction yet. We started actively tracking it in 2016. I believe most companies hide it, so it is hard to say whether it’s bad or bearable. "Experts" say that you should have +20% NPS to have a successful product. We have -20% NPS at the moment:
Our long term goal is to raise NPS to positive numbers (at least). It is easy to say, but hard to master. During the last year, we’ve only managed to raise it from -22% to -20% (and most likely this is just a measurement error).
One more metric we’ve added to track product simplification is Support Load. It shows how many requests (issues) and support chat sessions we have per 100 MAU (monthly active users). The general idea is that this metric should be lower for a simpler product. If it drops, it means we are doing well with our goals of simplification and unification. It looks like this is the case for 2017:
Other metrics are not so important for us now, but still it is interesting to check Cycle Time changes over 7 years. Increase in Cycle Time is mainly caused by less frequent builds. We used to release about 30 builds per year, now we are doing about 20, and it is close to one build every 20 days. To reduce Cycle Time, we have to break up the monolith into services and release them independently. We already have some of them ready, and a user story inside a service can move from start to production in several hours. We expect Cycle Time will drop dramatically in 2018.
The Done User Stories chart is nice. We had a huge decline of throughput in Q1-Q2 2016, but Open Allocation boosted team performance. Interestingly enough, when we moved to Teams Competence, performance declined, but then recovered.
|Retrospectives||Daily meetings||User Stories split|
|2008||Every 2 weeks||Yes, at 10 a.m., 10 minutes on the average||User stories should fit in 1 week iterations.|
|2009||Yes, at 11 a.m., 15 minutes on average||We split stories, but sometimes not as aggressively as required.|
|2011||We run Just In Time meetings instead of periodic meetings. We have an Issue Board, limited to 3 items. When the limit is hit, we run a meeting.||It is still our weak side. Sometimes user stories are too big.|
|2012||We have stop-the-line meetings with people related to the issue. They are quite rare now.||Some improvements reached, but it's still a very problematic practice.|
|2013||No retrospectives||Yes, at 11 a.m., 7 minutes on average|
|2014||Regular, team level, teams decide|
|2015||Company wide retrospective||Yes, at 11 a.m., 9 minutes on average||Still hard|
|2016||Irregular team retrospectives|
|2017||Team level, ~1 per month.||One unit kept them as optional, another unit removed them||Don’t ask|
Open Allocation almost ended our retrospective practice. New teams were focused on real work instead. Teams were somewhat fluid, so why bother to reflect when in a month the team may disappear? This led to a serious problem (in fact, one of the top problems in our development process): how to set cross-team practices, and make sure that teams really follow them. More on that later.
This year, we brought team level retrospectives back. New teams in general think that retrospectives are helpful; they help to spot many problems and find solutions. The opinion of more mature teams is less positive, since they somehow rarely get interesting ideas from this practice. I think the general rule is bi-weekly retrospectives for fresh teams, monthly retrospectives for more mature teams, and quarterly retrospectives for jelled teams.
After we split the whole development force into two Units, each Unit set their own process. Daily meetings survived in the Core Unit (as an optional meeting), while UX&S Unit decided to halt this practice. We do however have Daily meetings on a team level. The Core Unit has an optional Unit-wide daily meeting, but quite often it is fully packed. The Core Unit demands more coordination, since they are working on a new platform and rebuild many things from scratch.
In my opinion, jelled teams don’t need daily meetings. They are very good for fresh teams and problematic teams, but highly experienced teams communicate all the time and thus easily maintain a unified context.
User Stories split
We’ve been struggling with this for 10 years. You know what? We haven’t won. The situation is very similar to trench warfare during World War I. We’ve achieved some bearable level, but can’t make it a solved problem. Quite regularly, stories are still too large. Is it inevitable for any complex product? Is it an NP-hard problem of software development process? Perhaps both.
Reflect & Adapt / Process Coach role
I briefly mentioned that one of our hardest problems is ensuring that teams follow agreed practices. For example, we all agreed that retrospectives should be regular and at least monthly. In reality, some teams don’t follow this. They think that retrospectives bring no value, and are reluctant to run them. Well, maybe this is OK, since the team knows better. What is not OK is the failure of a team to follow its own practices. Let’s say, the team decided to give pair programming a try, but then somehow didn’t. Who should force the team to follow the practices it approved? Remember, we don’t have managers inside teams, and we don’t have Scrum Masters (I personally don’t like this role, since in most cases it is just a replacement of manager. Somehow, Scrum Masters tend to evolve into managers. God knows why).
We decided to add a Process Coach role that helps team to reflect, improve, and follow the agreed process. The team is free to select any process, but it has to reflect & improve. One Process Coach can help 1-4 teams.
This role was introduced in June, but even now we have zero Process Coaches. As you see, even as a company we are not always great at following our processes. I believe this problem is not unique for our company. Discipline and responsibility regarding processes is a common problem in the IT industry. Management hierarchy can partially solve it, but is there a better way? We are trying to find it.
|Local / Team level||Global / Cross-team level|
|2008||Release planning (team)|
Iteration planning (team)
Iteration demo (team)
|2009||User Story kick-start (3-4 people)|
User Story demo (4+ people)
|2011||User Story kick-start|
User Story demo
|2013||User Story kick-start||Product board (weekly)|
Development board (weekly)
|2014||User Story kick-start|
Retrospectives (every 2-3 months)
|Product board (monthly)|
Retrospectives (1-2 per year)
Development board (monthly)
Feature demos (monthly)
|2015||User Story kick-start|
Retrospectives (every 2-3 months)
Product Specialist + Development Team (weekly)
Development board (weekly)
|2016||User Story kick-start||Feature demos|
|2017||User Story kick-start|
Open allocation nearly eliminated team-level meetings, as well as most company-wide meetings. Feature demos became more important, since it was at this event that teams demonstrated the results of an Initiative (the meeting date served as a deadline for the Initiative).
Earlier this year, we decided to replace many of the smaller feature demo meetings with a large 4-hour company wide Quarter Results meeting. All departments demonstrated their achievements. It appeared these events are exhausting. The flow of information is unbearable, you can’t dig into details, and everything is rushed. Most likely, we will stop Quarter Results meetings and get back to feature demos.
UX and Craftsmanship
|2011||Sketches (many ideas)|
live usability tests on prototypes
|Salary depends on learning|
A paid major conference visit for every developer
5 hours per week on personal education
|2012||Sketches (many ideas)|
Live usability tests on a real system
|Salary depends on learning|
A paid major conference visit for every developer
5 hours per week on personal education or projects
|2013||Cross-functional UX teams|
Sketches (many ideas)
8 hours per week to personal projects (Orange Friday)
|2014||Cross-functional UX teams|
Sketches (many ideas)
Huge gap between UX and Development.
|Focus on getting things done|
|2015||Cross-functional UX teams|
Sketches (many ideas)
|20% of time on own projects (team orange months)|
Developers’ backlog to handle the technical debt
|2016||Anarchy (or Freedom?)||Arch Board|
Documented UX process
Core & Infrastructure Unit
Formal UX processes were removed with the addition of Open Allocation. Each team had a UX Designer and Initiative Source. This pair was responsible for the final functional solution, and they were free to do it as they pleased. I’d say it worked pretty well from a UX point of view: all solutions were at least OK and useful.
This year we restarted UX process formalization, revisited it, and documented it. The process has several must-have steps, while everything else is just a recommendation. One interesting mandatory step is Success Metrics definition.
Imagine we discover some problem and want to solve it. How do we know whether the problem is solved or not? For example, we had many complaints about Search functionality. The solution was to implement a new Search module from scratch. When to stop? You should have something measurable to define success. It can be something like this:
- Number of search queries should increase by 100% in the three months after release
- Amount of negative feedback with search mentioned should decrease by 50% in the three months after release
Success Metrics force teams to think deeper about the problem and solution. They provide a mechanism to evaluate success and answer the question: "Did we solve the problem?"
We ran an experimental Design Sprint. The goal was to find solutions to navigation problems in Targetprocess. Nothing spectacular happened in the end, but it was a really great experience as an inspirational team building exercise. Design Sprints can work, so we might use this practice from time to time to solve large problems.
It is really hard to maintain a high level of craftsmanship inside an organization. You have constant pressure from the market. You have to follow major trends, add features quickly, and respond to customers requests. The usual mode for most software development companies these days? Rush! Speed is everything. It takes an enormous amount of courage and will to slow down and think deeper. Usually it just doesn’t happen. Some Friday, driving back home, you’re hit by the thought that you’ve spent a week implementing an uncomplicated user story. This is it. Technical debt got you. We fought it for 10 years, and nearly lost.
I won’t re-iterate all the practices we tried. Below are the learned lessons for a company beyond startup mode:
- Don’t rush.
- If you have to rush – don’t.
- If you HAVE TO rush for the business, dedicate 1.5x amount of time to fix technical debt later. For example, if you’ve implemented a feature quick-n-dirty in 2 months, dedicate 3 months for a re-write later (if it survives).
- Note: If you have a significant time frame allocated to technical debt, it will be replaced by some new feature. So — don’t rush.
- Document ALL controversial solutions. It will be a source of incredible insights later, and you will learn faster. At least you will learn.
- Listen to the most experienced developers. They are usually right.
- Overtimes don’t work in the long run.
- Aim for a small Core with extensive Solutions on top of the Core as independent apps/services/whatever. Maintain the highest possible quality for code and architecture of the Core; don’t worry about Solutions much.
Targetprocess architecture was buried by accumulated piles of technical debt. We should have addressed it around 2012, but pursued more functionality instead. In retrospect, this was a mistake. What now? We are rewriting major parts of our application from scratch, using a service-oriented approach.
Here are our company’s focus changes by year:
Development practices & Technology
|Source control and branching||Git. We are trying a single-branch development again to enable continuous delivery. It is impossible to do that with feature-branches||Back to feature branches. Gitflow with a mandatory Code Review||Each microservice has its own repository and branch structure. Code review is still mandatory|
|Pair programming||Pair programming is completely optional. The mini-team decides for itself||We use pair programming less and less. This practice is fading away||PP completely died|
|TDD/BDD||Clear focus on BDD. We've customized NBehave for our needs and implemented a VS add-in to write BDD scenarios||Stopped BBD. We discovered that BDD tests are hard to navigate and maintain. Even custom developed tools didn’t help||Stopped TDD. Now we write unit tests after the code is created|
|Automated Functional Tests||We are migrating to Selenium WebDriver||Custom JS test framework||CasperJS tests||CasperJS tests are deprecated. Use Custom JS test framework||C# + WebDriver, Node.js + Chrome Headless + Puppeteer|
|Continuous Integration||Still use Jenkins. The goal is to eventually have a Continuous Delivery process||Continuous delivery is a mirage…||We created BuildBoard.|
Deployment is fully automated
|We use staging servers and roll out new builds gradually with instant roll-back if required||We use GitLab pipelines that create Docker-images and generate Helm Charts|
|Feature toggling||We apply feature toggling and are heading towards a continuous delivery process||Per account feature toggling. Fast feature disabling||Feature toggling is used extensively|
|Code duplication tracking||We started to track and fix code duplication||This practice wore off|
|#Public Builds per Year||26||24||34||27||25||18|
Microservices & Platform
Our company never had a documented technical vision. We hired experienced developers and expected they’ll do their best. They did, but it appeared that a vision hadn’t emerged, and the different technical solutions were not aligned at all. We tried to set up an Architecture Board in 2016 that controlled all technical solutions. It kinda worked, but most likely it was too late.
In March 2017, we nailed down a new vision for Targetprocess — Platformization. It became obvious that major technical changes are required to move the product forward. An R&D team was assembled from top developers, and in two months we produced a quite detailed technical vision of the Targetprocess Platform.
A microservice approach makes up the core of this technical vision. We are aware of problems with microservices, but it still looks like the way to go in our context. At the end of 2017, we are starting to see some positive outcomes, such as speedy deployment time for new services. However, there are many problems with infrastructure, and infrastructure complexity is high. Our main goals are:
- Accelerate development and deployment
- Enable custom development for large customers based on a platform, while keeping the main product intact
- Reduce infrastructure cost, and migrate to Linux servers
In 2018, we will evaluate the results.
Team splits lead to less builds. The Core Unit does not contribute to functional improvements and fixes, so only half of developers are working on releasable things at the moment. With the microservices approach, this metric will be irrelevant, since every service will have its own release cycle.
GitHub -> GitLab
Core Unit migrated to GitLab. GitLab has a powerful set of CI/CD services (gitlab pipelines, docker registry). It helps us to automate many tasks regarding testing and deployment. GitLab is less stable than GitHub, and slower, but things are improving.
We use GitLab pipelines to create Docker images and Helm charts, and to run unit tests and service tests. Helm charts are published to our GitHub repo. When a developer wants to deploy a service to a staging or production cluster, he installs Helm charts to Kubernetes using our custom deployment service.
|Number of repos||Number of clusters||Typical production cluster||Typical test cluster in AWS|
|119 repos on GitLab|
219 repos on GitHub
|8 test clusters|
13 production clusters
|110 pods overall|
15 production services
25 infrastructure services
|4 large EC2, 3 medium EC2|
3 of them are Windows, 4 are Linux machines
Some real numbers to help demonstrate the size and complexity of the product.
|Virtual machines to run tests||38||62||52 virtual machines for Targetprocess Core CI.|
15 AWS EC2 instances for new microservice CI
|Short circle build time (without package)||40 min||50 min||60 mins|
|Full circle build time (with package)||1 hour||2 hours||1.5 hours|
|Git commits||8,400||14,000 (in 2014)||15,000 (in 2016)|
Client: 145,000 (JS)
Server: 296,000 (C#)
Client: 192,000 (JS)
Server: 200,000 (C#)
One interesting observation is that there is less server-side code in the main app now. Some parts of the system were already migrated to separate services, and some old parts were just removed.
Test infrastructure is migrating to AWS slowly. In the long run, we want to move the whole CI process to the cloud.
“If you obey all the rules, you miss all the fun” — Katharine Hepburn
P.S. Reddit discussion