Mathoholic Systems

Your Dockerfile Is the Problem

Shantanu Sharma — Sat, 20 Dec 2025 19:34:31 GMT

Why Some Builds Are Fast and Others Are Painfully Slow

When people complain “Docker builds are slow,” what they really mean is one thing:

They didn’t design their Dockerfile properly.

Most articles, blog posts, and AI chat responses try to “fix” slow builds by:

disabling cache
increasing CI timeouts
using bigger runners
blaming Docker itself

None of these actually fix the underlying issue.

Docker builds are predictable. Just look at the Dockerfile.
If your builds are slow or unpredictable, it’s because of how your Dockerfile is written.

This article explains what’s really going on, how Docker caching works in real life, and exactly what you must do to fix slow builds once and for all.

Docker Builds Aren’t Magic : They’re Predictable

When you run:

docker build .

Docker does three things:

Uploads your build context to the daemon (or remote builder)
Walks through your Dockerfile, step by step
For each instruction, checks if it can reuse a cached result

This is not a compiler with AI. It’s a filesystem snapshot engine.
Each instruction creates a layer a snapshot of the filesystem at that point.

These layers are immutable. Once a layer is created, it never changes.When you build again, Docker doesn’t “re-run” everything. Instead, it compares each instruction and its inputs with a previously built layer:

If Docker can prove that the instruction and its inputs didn’t change, it will reuse the layer from cache.
If anything changed even something irrelevant ,the cache breaks and Docker re-runs that step and all subsequent ones.

That’s the whole model.

The Real Causes of Slow Docker Builds (and How to Fix Them)

Let’s unpack the real reasons builds slow down and how you fix them.

1. Bad Layer Ordering Nukes Cache

If you copy everything before installing dependencies, you’ve guaranteed rebuilds on every change.

Bad Pattern

FROM node:latest

WORKDIR /app

COPY . .

RUN npm install
RUN npm run build
CMD ["npm", "start"]

What’s wrong here?

node:latest is non-deterministic you don’t know what you’re building tomorrow ( if new package is available tomorrow it will pull that hence cache HIT got missed…rebuilds that again)
COPY . . invalidates cache for everything whenever any file changes
So every tiny source tweak reruns npm install

Fix: Copy only what matters, in the order that matters

FROM node:20-alpine AS builder

WORKDIR /app

COPY package.json package-lock.json ./
RUN npm ci

COPY . .
RUN npm run build

FROM nginx:1.25-alpine
COPY --from=builder /app/dist /usr/share/nginx/html
CMD ["nginx", "-g", "daemon off;"]

Why this is better

Pin a stable base (node:20-alpine) —> reproducible builds
Copy only dependency manifests before install —> that layer stays cached as long as your dependencies don’t change
Copy app code later —> changes here don’t invalidate the dependency install

This simple reordering often cuts rebuild time by 70–90%.

2. “Latest” Kills Reproducibility

If your BASE image changes under you, cache semantics become unreliable.

FROM node:latest

Today’s build not same as tomorrow’s build.

Use versioned base images instead:

FROM node:20-alpine

This fixes:

reproducibility
downstream debugging
predictable cache

3. Every Line Creates a Layer

Dockerfiles are immutable histories.
Every RUN, COPY, ADD becomes a layer.

If you install something and then delete it in a later line, the data is still in earlier layers and therefore your image is still big.

Bad:

RUN apt-get update
RUN apt-get install -y build-essential
RUN rm -rf /var/lib/apt/lists/*

Better:

RUN apt-get update && \
    apt-get install -y build-essential && \
    rm -rf /var/lib/apt/lists/*

Now the package lists don’t survive in a separate layer.

4. Build Tools Don’t Belong in Runtime Images

Build tools are only needed at build time. Shipping them to production is wasteful.

Use multi-stage builds intentionally:

FROM golang:1.22 as builder

WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download

COPY . .
RUN go build -o app

FROM gcr.io/distroless/base-debian12
COPY --from=builder /app/app /app
CMD ["/app"]

Final image includes only:

the binary
runtime libs

No Go, no compilers, no shells, no build cache.

This:

cuts image size
reduces attack surface
eliminates unnecessary rebuild work

5. Large Build Context = Slow Upload

Docker always uploads your build context to the builder.

If your context includes:

node_modules
.git
logs
tests, docs, temp files

Then every build has to send megabytes or more over the wire.

Fix this with a .dockerignore:

node_modules
.git
*.log

Smaller context = faster uploads = faster builds.

6. CI Isn’t Lying: It Exposes Bad Dockerfiles

Locally, you might have warm cache.
CI runs on fresh machines.

That means:

no existing cache
slow cold builds
every package download happens again

If your Dockerfile depends on warm local cache to be fast, you built it wrong.

In CI, you must explicitly:

export/import cache
use BuildKit with --cache-from / --cache-to
or use dedicated layer caching

Otherwise your CI builds always recreate steps that could be cached.

7. Pin Dependencies, Don’t Let Them Float

Floating dependencies (latest, *, unpinned versions) make builds unpredictable.

Lockfiles (package-lock.json, go.sum, requirements.txt) should only change when you change dependencies, not every code update.

This means:

cache hits stay valid longer
CI builds stable graphs
debugging is possible

A Simple Mental Model

Here’s the core truth you should adopt now:

Docker builds are predictable. Your Dockerfile determines whether they’re fast or slow.

Treat Dockerfiles as:

deterministic build graphs
ordered instruction sequences
cache design problems, not scripts

When you write a Dockerfile, ask:

What changes frequently?
What changes rarely?
What steps can stay cached?

Design around cache boundaries, not commands.

Checklist for Faster Docker Builds

Before shipping or committing a Dockerfile, ensure:

🟠 Base image is pinned
🟠 Dependency install layer is early
🟠 App code is copied after deps
🟠 Build context is minimal
🟠 Multi-stage builds separate build & runtime
🟠 No unnecessary tools in final image
🟠 Lockfiles are present
🟠 CI cache is configured

If any of these are missing, your builds aren’t engineered, they’re accidental.

How Containers Actually Work (and Why Most Devs Get It Wrong)

Shantanu Sharma — Sun, 14 Dec 2025 13:30:11 GMT

Why this article exists

Most developers think they “know Docker” because they can run:

docker build, docker run, docker-compose up

That’s not understanding. That’s muscle memory. If this is where your Docker knowledge stops, you are operating at cargo-cult level:

You copy commands
You don’t understand consequences
You panic when things break

Docker is not magic. Docker is Linux primitives glued together with tooling.

Until you understand what actually happens under the hood, you will:

Debug blindly in production
Lose data due to bad volume configuration
Break networking and blame Docker
Ship bloated images

This article strips Docker down to its bones.

What Docker really is (no marketing nonsense)

Let’s kill the myths first. Docker is not a virtual machine replacement, not a deployment platform, and not a magic packaging tool. What Docker actually is much simpler and far more important: it uses Linux namespaces to isolate processes, cgroups to control resource usage, and union filesystems to build layered images, all coordinated by a long-running daemon called dockerd. Everything else you interact with — Dockerfiles, the CLI, Docker Compose — is just user experience built on top of these primitives. Docker didn’t invent containers; Linux did. Docker’s real contribution was making those low-level Linux features usable for everyday developers.

Containers vs Virtual Machines (the lie you were told)

People often say: “Containers are lightweight VMs.”
That sentence has caused more production failures than bugs.

Aspect	Virtual Machine	Container
Kernel	Separate	Shared with host
Boot time	Minutes	Milliseconds
Isolation	Hardware-level	Process-level
Overhead	Heavy	Lightweight
Security boundary	Stronger	Weaker (by design)

The uncomfortable truth

A container is just a process with constraints. ( it means we can allocate resources to it)

Same kernel.
Same OS.
Same host underneath.

If that sentence makes you uncomfortable, good.
It means you’re starting to understand Docker properly.

The filesystem illusion (Union FS explained simply)

When you pull a Docker image, Docker does not download one big file. It downloads layers.

Typical layers look like:

Base OS layer
Runtime layer (Python, Node, Go)
Dependency layer
Application code layer

Important facts about these layers:

They are read-only
They are shared across containers
They are cached aggressively

What happens when a container starts?

Docker:

Stacks all read-only layers
Adds one thin writable layer on top

All file changes go only to that writable layer.

When the container is deleted:

The writable layer disappears
Your data disappears with it

That’s why:

Writing data inside containers is a rookie mistake
Containers are disposable by design
Volumes exist

Volumes: where most systems go to die

Storage is where Docker setups usually collapse. You don’t get many choices—and choosing the wrong one guarantees pain.

There are three options.

Container filesystem (bad)

Data lives inside the container’s writable layer
Destroy the container → data is gone
Only acceptable for temporary, throwaway files

Use this for anything important and you’ve built a self-destructing system.

Docker volumes (correct)

Managed by Docker
Independent of the container lifecycle
Portable, predictable, and easy to back up

This is what production systems are supposed to use.

Bind mounts (dangerous)

Directly map host filesystem paths into containers
Environment-specific and brittle
Easy to break, painful to debug

Great for local development.
Risky in production unless you know exactly what you’re doing.

One rule you must remember

App code lives in images.
App data lives in volumes.

Break this rule and production will punish you.

Networking: Why `localhost` Betrays You

Inside a container:

localhost points only to the container itself
Not the host
Not other containers
Not “where the database runs”

This is where many systems break.

What Docker Actually Sets Up

Docker doesn’t rely on magic. It creates:

Virtual network bridges
Virtual network interfaces
An internal DNS server

Every container on the same Docker network gets automatic service discovery.

That’s why this works:

db:5432

The Hard Truth

Docker resolves service names through its internal DNS. The moment you hardcode IP addresses or depend on localhost across containers, you’ve brought fragility into the system. It may appear to work today, in your environment, on your machine, but it will fail in production, usually under load or during a redeploy.

The Docker daemon: single point of control

Everything in Docker flows through dockerd:

Building images
Pulling images
Creating networks
Managing volumes
Running containers

If dockerd crashes:

Containers keep running
You lose control and orchestration

This surprises people. It shouldn’t.

Containers are Linux processes.
Docker is just the manager.

This is not a bug.
This is how Linux works.

What you should take away

If you remember only five things:

Containers are not VMs
Containers are processes
Filesystems are layered illusions
Data inside containers is disposable
Docker is Linux with a nice CLI

Once this clicks:

Docker stops being scary
Debugging becomes logical
Production failures make sense

Ignore this, and Docker will keep “mysteriously” failing.

What’s next

In the next article, we’ll dissect:

Why 90% of Dockerfiles are inefficient — and how to fix them

Most Dockerfiles in the wild are bloated, slow, insecure, and poorly cached, usually because they are written by copying patterns without understanding how Docker actually builds images. And yes, you are probably doing it wrong and fixing it will immediately make your builds faster, your images smaller, and your systems easier to run in production.

Make Your Python Code Faster with Dictionary Lookups

Shantanu Sharma — Mon, 12 Aug 2024 03:48:38 GMT

In the world of programming, efficiency is key. Whether you're a beginner or a seasoned developer, finding ways to optimize your code can make a significant difference, especially when working with large datasets. One powerful tool in Python that often goes underutilized is the dictionary lookup. In this post, we'll explore how dictionary lookups work, why they're so efficient, and how you can leverage them to enhance your Python programs.

What is a Dictionary Lookup?

In Python, a dictionary is a collection of key-value pairs, where each key is unique, and each key maps to a specific value. This data structure is incredibly versatile and allows for fast access to data. A dictionary lookup refers to the process of retrieving a value associated with a specific key in a dictionary.
Here is an example:

# Creating a dictionary
fruit_colors = {
    "apple": "red",
    "banana": "yellow",
    "cherry": "red",
    "orange": "orange"
}

# Performing a dictionary lookup
color_of_apple = fruit_colors["apple"]
print(color_of_apple)  # Output: red

In this example, the key "apple" is used to quickly access the value "red". This operation is extremely fast, and that speed is one of the primary reasons why dictionary lookups are so valuable.

Why Are Dictionary Lookups So Fast?

The speed of dictionary lookups is primarily due to the underlying data structure: hash tables. When you create a dictionary in Python, each key is hashed using a hash function, which generates a unique identifier for that key. This hash is then used to determine where the corresponding value is stored in memory.

Because the hash function allows for direct access to the location of the value, retrieving a value from a dictionary typically takes O(1) time, meaning it’s constant time, regardless of the size of the dictionary. This is in stark contrast to other data structures, such as lists, where searching for a value can take O(n) time (linear time) because you might need to check each element.

When Should You Use Dictionary Lookups?

Dictionary lookups are particularly useful in situations where you need to frequently access data based on a unique identifier. Here are a few scenarios where dictionary lookups can greatly enhance your code:

Data Mapping: When you have a set of unique keys that map to specific values, such as user IDs mapping to user information.

user_data = {
    "user123": {"name": "Alice", "age": 30},
    "user456": {"name": "Bob", "age": 25},
    # more users...
}

Caching Results: If you’re performing an expensive computation or accessing a slow resource (like a database or API), you can store the results in a dictionary and reuse them to avoid repeated operations.

 expensive_computation_cache = {}
 def expensive_computation(x):
     if x in expensive_computation_cache:
         return expensive_computation_cache[x]
     # Perform the computation
     result = x * x
     expensive_computation_cache[x] = result
     return result

Fast Membership Testing: If you need to frequently check if an item exists in a collection, using a dictionary (or a set, which also uses hashing) allows for O(1) membership tests.
```
 blacklisted_emails = {"spam@example.com", "junk@example.com"}
 if email in blacklisted_emails:
     print("This email is blacklisted.")
```

Building Lookup Tables from Lists

In some cases, you might start with a list of dictionaries or tuples and need to frequently search for items based on a specific attribute. Instead of iterating through the list every time, you can convert it into a dictionary (often referred to as a lookup table) to speed up your searches.

For example, let's say you have a list of products and you want to quickly find a product by its ID:

products = [
    {"id": 101, "name": "Laptop", "price": 799},
    {"id": 102, "name": "Tablet", "price": 499},
    {"id": 103, "name": "Smartphone", "price": 699},
]

# Build a lookup table for products by ID
product_lookup = {product["id"]: product for product in products}

# Now you can quickly find a product by its ID
product = product_lookup.get(102)
print(product)  # Output: {'id': 102, 'name': 'Tablet', 'price': 499}

By transforming your list into a dictionary, you transform your search operation from O(n) to O(1), greatly improving efficiency.

Pitfalls to Watch Out For

While dictionary lookups are powerful, there are a few things to be aware of:

Memory Usage: Dictionaries are fast, but they can use more memory than lists due to the overhead of storing hash tables. If memory is a constraint, consider this trade-off.
Mutable Keys: In Python, dictionary keys must be immutable (e.g., strings, numbers, tuples). If you try to use a mutable type (like a list) as a key, you’ll encounter an error.
Collisions: Although rare, hash collisions can occur, where two different keys produce the same hash value. Python handles this internally, but it’s something to be aware of if you’re working with a large set of keys.

Conclusion

Dictionary lookups are a fundamental tool in Python programming that offers significant performance benefits. By understanding how they work and when to use them, you can write more efficient, scalable, and maintainable code. Whether you're dealing with large datasets, building a cache, or simply trying to speed up your searches, dictionary lookups should be one of your go-to techniques.

So, next time you're faced with a problem that involves frequent data retrieval, consider reaching for a dictionary—you'll be amazed at how much faster your code can run!

#Python #Programming #DataScience #CodingEfficiency #TechTips #Optimization

Beginner's Guide to Machine Learning

Shantanu Sharma — Sat, 22 Jun 2024 04:16:16 GMT

This series of articles will make machine learning easier to understand with resources to learn in depth.

What is machine learning?

There are complex definitions available which defines machine learning in more technical and mathematical terms. To start with a simple one we will consider this one :

Machine Learning is the field of study that gives computer the ability to learn without being explicitly programmed.

~ Arthur Samuel, Computer Scientist

Samuel's definition of machine learning, distinguishes it from traditional programming. If you are programmer you must know that in programming the rules to deal with input data are definied and on the basis of those rules our program generates output. Machine learning flips this whole paradigm.

In machine learning, data points and their corresponding correct answers (labels) are provided to coomputer. The computer uses this information to learn patterns and rules. So we don't make rules here. These learned rules allows the computer to predict the correct answers for new and unseen data. Machine learning is data driven process.

How machine learning works?

As machine learning is data driven process, data of good quality is of great importance to build a efficient machine learning solution. The process of machine learning involves seven broad steps.

Data Collection:

The step for gathering data is foundation of machine learning process. The quality and quantity of data that we gather directly determines how well or badly the model will work.
Preparing the Data:

This step involves data wrangling to remove duplicates, correct the errors, deal with missing values, data type conversions and so on. We use data visualization techniques, to see relevant relationships between the different attributes and check for outliers. To select the right attributes for the model. The data is randomised so that the order of data do not effect what is learned. Another important step is to divide data into two parts. The larger part (~80%) would be used for training the model and the smaller part is used for the evaluation of trained model's performance.
Choose a model:

Data Scientist around the world have developed models for different purpose and goals. In the existing models out there we select the model which suits our purpose and goal well.
Training:

The step of Training, involves running the model on the prepared dataset, allowing the model to learn from it and make predictions. While the model process the data, it tries to find the hidden pattern.

The training process involves starting with some random values for the model's parameters, say X and Y. The model uses these values to make predictions, which are then compared to the actual correct answers. Based on this comparison, the model adjusts X and Y to improve its predictions. This process repeats, and each cycle of updating is called a training step.

The training process is iterative, meaning the model will continually learn and improve as it processes more data.
Evaluation:

The part of dataset created for evaluation is used to check the model's proficiency. This puts the model in a scenario where it encounters situations that were not part of its training.
Fine-Tuning:

After we have evaluated the model's performance, we can still improve its accuracy by fine-tuning certain parameters. These parameters, which were set implicitly during the initial training, can be adjusted to enhance the model's predictions further. This process of adjusting parameters to optimize the model's performance is known as parameter tuning or hyperparameter tuning.

These hyperparameters might include how fast our model learns (learning rate), the number of layers in a neural network, or the number of groups it makes in clustering alogrithm.

To do this (fine-tuning) we try different combination of these settings and see how well the model performs. The goal is to find the best combination of settings. There are several ways to tune hyperparameters:
- Grid Search: Test all possible combinations.
- Random Search: Test a random selection of combinations.
- Bayesian Optimization: Use a smart method to predict which combinations will work best.
Deployment & Monitoring:

Once the model is trained and its hyperparameters are optimized, it is integrated into a production environment where it can start making predictions on new, unseen data. Often, models are deployed as part of a web service, accessible via APIs (Application Programming Interfaces) so other applications can send data to the model and receive predictions.

We continuously monitor the model's performance to ensure it maintains accuracy and efficiency. This involves tracking metrics such as prediction accuracy, response time, and resource usage. The model is re-trained periodically so that it remains up-to-date.

The final step ensure that model remains useful, reliable, and relevant in its production environment.

Biblography:
Doshi, R., & Hiran, K. K. (2021). Machine Learning. Paperback.