Platform Thinking

What is the role of AI in your platform?

We hear much today about how AI is going to do our jobs better than us.

AI is powerful and has access to a lot of information and does a lot of things humans can’t do.

It is really great at writing shell scripts, boilerplate in programs, figuring out how to use an API, summarizing public information on the Internet, etc.

But these tasks are not the differentiator.

The real value is going deep.

Information found only in books, which AI does not have access to.

Experience found only in people and encoded in platforms.

AI in a product can be useful for extracting information from YOUR data.

General AI tools are platforms in themselves, but probably not YOUR Platform, and probably not the creator of the primary value of YOUR Platform.

YOUR Platform is for extracting and leveraging the deep – the value people need.

The Platform Test

How do you know if you own YOUR Platform?

If a customer needs a new hardware interface or connector on a product, can you easily add that?

If a security vulnerability is found in a piece of software in your stack, can you fix it?

If a customer wants to use a new USB peripheral in a product, can you release a new SW version with that driver included?

Can you do all the above quickly, easily, and with confidence?

If so, you likely own YOUR Platform.

Here’s the thing – embedded systems today are general purpose, updatable, and expandable. That means people will do things with them that you never imagined.

A fundamental characteristic of a platform is that it will go places you never imagined.

Are you prepared for that?

Preparing for the future

Yesterday, we discussed a fundamental characteristic of a platform: it will go places you never imagined.

How do we prepare for this?

There are two ways:

  1. Try to predict where things will go in the future and build specific features into your platform now – just in case …
  2. Build the tooling, processes, and workflows so that you can easily and confidently add functionality when it is needed.

It is obvious which approach works better – if we try to predict the future (at least the specifics), we are more often wrong.

We can either try to build the future now, or prepare to meet it as it comes.

The line between these two is often difficult to discern.

What will you improve today?

Platforms are all about improvement – at the personal, team, and company levels.

One approach is to each day write down something you are going to improve and set aside a small block of time daily to work on it. Make this part of your personal process.

This does not have to be something big – clean your workspace, create a checklist for something you don’t enjoy doing, automate something, write some documentation to help others on the team, write a test for some troublesome code, refactor something, improve CI/CD, …

The internal improvements you do today, while not directly seen by your customers, will help you deliver something better tomorrow.

What will you improve in YOUR Platform today? Reply and I’ll compile a list and share it in a future post.

Do you own your deployment?

Saturday morning, I got a call from a customer – something was not working due to a bug we had deployed Friday (no we don’t have very good tests) :shushing_face:.

The fix was easy, I tested it locally, and then tried to push it to a Git hosting service we are using, but the Git service was down.

Now what? Our Ansible deployment script pulled directly from git, built the program, and then deployed it.

While I could reverse engineer the build from the Ansible scripts and do it manually, that would have taken time and introduced the possibility of another error.

So I pushed the repo to my Gitea server, tweaked the repo line in the Ansible script, and deployed the update – not a big deal.

This brings up a question though – we don’t usually think of deployment as critical infrastructure – not a big deal if it is not working – until you need to fix something quickly in production.

What if the deployment was wrapped up in some CI/CD workflow that only worked in vendor X’s cloud service?

Maybe simple deployments are actually better – a shell script that lives in the project repo that you can run anywhere. This could still be called by a CI process for the normal workflow.

All computing systems have the potential to fail – it does not matter how big vendor X is – their stuff can still fail.

Networks occasionally have problems.

DNS can have issues.

Systems get hacked.

No matter how many layers of complexity we pile on top of this.

In networked computer systems, the simplest path to resiliency is the ability to QUICKLY rebuild systems, whether that is your workstation, laptop, server, or deployment system.

When things go wrong …

What do we do?

Do we focus on who/what to blame?

Or do we figure out a path forward.

How we are going to prevent this problem in the future?

Not by shaming someone into paralysis, but by fixing the process.

By improving YOUR Platform.

The opportunity for the individual/organization who made the mistake to help fix the process is a graceful way out and preserves their dignity.

What is the difference between YOUR Platform and other platforms?

We all use other platforms – operating systems, cloud services, middle-ware, hardware modules, etc.

It is tempting when building a product to piggyback entirely on someone else’s platform (AWS, .NET, one of the 100’s of IoT Platforms, etc.)

Society tells us – you can’t host your own service, deploy your own updates, design your own hardware, implement reliable systems, etc.

But at a small/medium scale, none of these things are very hard.

If you do them, you can simplify and optimize for your needs.

YOUR Platform is partly the ability to leverage other platforms, but also to build your own – where you are in control of the critical integration points.

The cost of updating dependencies, or not

As developers, we are often lazy when it comes to updating dependencies.

A short-term productivity hack is to not update them.

Leave our Yocto build at an old version.

Never touch go.mod or package.json – everything is working and I can keep focusing on coding features.

Don’t update our tools – we don’t have time.

… until things break, there is a security problem in a dependency, or we need a feature in a new version of something, etc.

And then things grind to a halt.

As Khem recently shared, Maintenance is costlier than development, so even though development is important today, maintenance is more important – for tomorrow.

Part of YOUR Platform should be selecting technologies that can be updated regularly with little pain, and a process to do this.

It is the question of paying a little bit continuously, or a lot all at once later, and the latter is often so painful that in many cases it is impractical.

Investing in YOUR Platforming compounds positive gains – neglecting technical debt compounds negative gains.

Platforms are for building systems

If you are building a one-off, non-connected device, you can get by without a platform.

This is why so many design-shops don’t get platforms – they are designing something then moving on to the next project.

But if you are building a connected system, you have a much bigger problem to solve.

You now have a distributed system, and distributed systems are hard.

You are now living on the Internet with all its associated security concerns.

You have a system that has the potential to do so much more than it is doing now.

You have a system that has almost unlimited potential to be expanded.

You (potentially) have a platform.

Isn’t it risky to update your dependencies?

This is a common objection I hear when building industrial systems: “We want to lock things down to a super stable/tested LTS (Long Term Support) release and then stay on that release for a long time – it’s risky to update dependencies.”

Is it?

How often do you update your browser?

Your phone?

How often does Windows or MacOS force you to update your computer OS?

Do you worry every time it updates?

I’ve run Arch Linux for years and update routinely without worrying.

I update to new versions of Gitea every time they come without a concern.

I routinely update to the latest HEAD of Zephyr on projects during development and have rarely had a problem.

The same with about every software component I use.

Yes, there are safety-critical control systems that have stringent testing requirements, but we’re talking about complex connected systems that are mainly concerned with moving data around.

Where security is a concern.

With rare exceptions, modern OSS projects get more stable with each release, and to a lesser extent with each Git commit.

They have defined the laws of entropy.

How? With OSS workflows, testing, continuous integration (CI), more real-world usage, more user feedback and contributions, etc.

With good CI, changes don’t get merged to main until they are tested pretty well.

Transparency, community, and OSS workflows are powerful – really the only practical way to build complex technology.

The next time you seek the cozy cocoon of an LTS release for a dependency in YOUR Platform, think about what you might be giving up … features, improvements, community connection, and likely also stability.k

Does consistency matter?

If you have a single developer on a single project, then perhaps consistency does not matter too much.

However, if you want to scale, either products or developers, then consistency matters.

Why?

So that code does not get drastically reformatted every time someone makes a change, making Git diffs impossible to review.

So that any developer can easily understand and make changes in any part of the codebase.

So that new products can leverage previous efforts.

So that new developers can be more easily on-boarded.

So that we can see patterns and simplify systems.

So that our systems are tested.

So that documentation can be easily found.

Linux Torvalds is being lambasted for encouraging some consistency in Git commits.

But if you read his actual email, the request seems quite reasonable.

The Linux kernel has a well-defined coding style that all contributors are expected to follow.

Have we considered the impact this emphasis on consistency has had on the Linux Kernel’s success?

The irony of all this is that consistency is usually done in the name of the “team” or “reuse”. But if we reflect a bit, we are mostly just helping ourselves.

We can read and understand our own code in 6 months.

We can find stuff.

We can more easily make changes and improvements.

We have tools helping us.

A little bit of consistency goes a long way in building YOUR Platform.

How can we be more consistent?

What does not work very well is long standards and endless code reviews where we shame people into compliance. There are better ways.

We now have tools that can proactively format our code. We have linting tools that check things. Use them. Even if no one else reads our code, they help us.

Write tests. Perhaps their greatest value is that they give us a new perspective on our work, which leads to consistency.

We can have CI hooks that check for various things – check out the Zephyr project if you want a good example of this.

Being nudged by a CI tool to be consistent is a much better experience than being pulled over by the consistency police.

Does YOUR Platform have tooling that encourages consistency?

How do you partition systems?

In modern networked systems, there is often the debate how to partition the system.

Below is one example where you have three levels: cloud, gateway, and IO (input/output) nodes.

All of these devices have a processor and can run software and do stuff.

So where do you do stuff? There are two basic philosophies:

  1. Push as much of the processing upstream to the cloud (where you have “infinite” resources) and never update downstream devices.
  2. Do as much at the Gateway and IO nodes as possible, make decisions when possible with local data, and only push processed data and events upstream. This requires routinely updating the downstream devices as needs change and better algorithms are developed.

Initially, approach #1 seems simpler and may be appropriate in some cases.

You can centralize logic and decisions as much as possible.

It is easy to update the system – you only have to update one central piece.

Until you need to scale, or there is a network disruption …

And have 50 IO nodes per site that are sending frequent unprocessed data all the way to the cloud, and decisions are coming back down, even when nothing is changing.

And then you eventually have 1000 sites.

And then there is a cloud outage and the system is in limbo until the cloud issue is fixed. All data during the outage is lost.

And someone damages the CAN bus cable and an actuator is left in the ON state and a tank overflows because no one is making decisions.

And you are storing Gigabytes of unprocessed historical data that you never use.

Distributed systems are hard, but in the end they scale and are more resilient, especially when you are forced to be distributed by device locations. Within reason, the more processing you push to the edge, the better. “Distributed” in these systems is not an architectural choice, but is forced on us by topology. Why not leverage this since we need to deal with it anyway.

We can compare this to human systems. How well do autocratic, centralized organizations work? Where someone at the top makes all the decisions and the rest are minions?

They may work OK at a small scale, but even that is questionable.

A student of history soon observes that successful organizations push responsibility down. People at the top provide oversight but are only involved in local decision-making when coordination is required. Trusted people feed necessary information up as required.

Is it any different with hierarchical, distributed digital systems?

When edge nodes grow up

Yesterday we discussed how to partition functionality in hierarchical distributed systems.

In the past, it was difficult to do much at the edge nodes because they could not do much.

Resources were limited.

It was difficult to update software in them without being on-site and connecting a PC-based programmer, or before that replacing an OTP (one-time-programmable) chip.

But today things are different.

A powerful 32-bit MCU is as now as cheap as your father’s 8-bit PIC and your grandfather’s 8051.

It has plenty of memory for network/communication stacks, and in-system update routines.

We now have powerful operating systems like Zephyr that we can run on these edge nodes.

If we continue our analogy to human systems, when a person is young or inexperienced, we don’t push a lot of responsibility on them.

However, when they grow up, we want them to do more, make decisions, and look around and do what needs done.

Modern MCU’s are capable of doing a lot including running AI algorithms – maybe it is time our edge nodes grow up.

Where do you put stuff?

There are a lot of ways to organize product development information.

You could do something like:

  • productA
    • firmware
    • hardware
    • yocto-build
    • docs
    • manufacturing
  • productB
    • firmware
    • hardware
    • yocto-build
    • docs
    • manufacturing

The above is nice in that it keeps all the stuff for each product together, but discourages re-use between projects.

If you want to build a platform, then it may make more sense to do something like this:

  • Firmware
    • Boards
      • productA
      • productB
    • Apps
    • Common
  • Hardware
    • Part Libs
    • productA
    • productB
  • Yocto
  • Manufacturing
    • Common testing framework
  • Product
    • productA
    • productB
  • Doc
    • (general technology information)
    • (general processes, etc)

Each type of code/design files live in a common location if possible. The Product location is only for information that is truly unique to a specific product (documentation, top-level BOMs, etc) that does not fit anywhere else. In many cases, each of the top level categories are a separate repo. The Product repo may pull the other repos in as submodules, so you can version everything together.

If you want to build a platform, it helps to be intentional about this. Reuse and cross-product learning won’t just happen.

1 Like

What happens when someone else tries to use your work?

It does not build on machine X …

It does not work in scenario Y …

It is hard to understand by developer Z …

As developers, we naturally avoid the rough edges of whatever we are working on – it hurts.

When other developers try to build/run our stuff, it is amazing what they find.

The more people you have building and running your stuff, the better it will be.

Even two people vs. one makes a huge difference.

The platform approach says anyone should be able to easily check out, build, and run any code on any machine.

We’ve run this experiment with OSS projects for a long time – it works.

Encourage others to build and run your code – it will make YOUR Platform better.

How to keep a product maintainable?

How do you know if you have the ability to maintain your product?

With complex systems, having all the design files and source code does not always mean a lot.

Do you know how to build the source code?

Do you know why the hardware is designed the way it is?

Do you know how to test and verify changes?

Do you regularly use these tools?

The only way to know for sure if you can efficiently make a change to your product is to practice doing so.

If you don’t, things soon ossify to the point where the cost and risk of any change are prohibitive.

How to ask for help

Modern systems are too complex for any of us to know everything about them.

Many of the problems we encounter we have never seen before.

Communities built around various technologies are what fill this gap.

So the superpower in this age is first having the humility to ask for help, and then knowing who and how to ask for help.

Before asking anything, do a reasonable amount of work.

Someone once told me he puts thirty minutes into a problem before asking a co-worker for help, which seems like a good baseline. Any less than that you are likely needlessly interrupting and distracting someone from their work. Any more than that you may be spinning your wheels needlessly. The nature of the problem may justify more at times.

Sometimes our questions are met with silence.

Don’t just remind people that you need help.

Do more work, and bring more information, and then ask again.

If you ask more questions, always bring more information.

Help people help you.

This is awesome post I encounter this on open source mailing list a lot where folks are not asking questions clearly and not setting context with what they have tried

Such emails are often ignored

Doing more with less, rather than less with more

Building your platform does not mean you need to invent a bunch of stuff from scratch, or purchase expensive tools.

Rather, it is making better use of what you already have.

Automating your current workflows and deployment.

Reusing the designs and IP you already have.

Increasing the rate of iterations.

Automated testing so that you can deploy with confidence.

Platforms are not rocket science, and you don’t need fancy tools to get started.

A checklist can do wonders.

A few shell scripts and a cron job can get you a long ways.

Automation can be written in the languages you are already using on the project.

Building testing and diagnostics into your products rather than needing expensive, external fixtures.

Leverage the CI features built into tools like Github and Gitea.

Platforms are the boring work that clear the path and you give you time and space to do the interesting work that adds value to your products.