LTP 144: Understanding Modern Smartphone Cameras

Panel

Bart Busschots (host) – @bbusschots – Flickr

In this solo show Bart describes the technological tricks modern smartphone cameras use to take better photos than traditional optics and sensors alone could possibly achieve.

While this podcast is free for you to enjoy, it’s not free for Bart to create. Please consider supporting the show by becoming a patron on Patreon.

Reminder – you can submit questions for future Q & A shows at http://lets-talk.ie/photoq

Notes

Friend of the show Allison Sheridan recently invited me onto the NosillaCast to answer her specific questions about how the iPhone 17 Pro's 3 'holes' translate so to many 'optical quality' zoom modes. Allison is an engineer so she came armed with lots of detailed spec tables and very specific questions. It was a fun discussion, but it was very iPhone specific, and it was focused around Allison's specific questions. That conversation inspired me to re-visit the topic here, but to broaden it out.

You can argue about whether Apple is the best at making phone cameras, but what's undeniable is that Apple is one of the best. None of the techniques Apple uses are unique to Apple, though Apple's use of these techniques is remarkably refined and well executed. Apple seamlessly blend many advanced cutting edge technologies into a single unified camera experience that hides all that complexity from users, gets out of our way, and lets us concentrate on making great photographs.

Since all the high-end smartphone makers use the same technological toolbox, I thought it might be interesting to take a look at what's in that toolbox in 2025.

The problem to be solved

The laws of physics greatly limit what traditional digital cameras can achieve on smartphones — the sensors and lenses simply must be small, and what's more, the entire package much be very thin too. The laws of optics tell you there depth-of-field will be deeper than on traditional digital cameras, the sensors noisier, and the zoom levels lower.

Apple and others have certainly done their best to make the best possible lenses with amazing materials like samphire, the sensors as good as they can possibly be despite their size, and the optical train surprisingly long despite being thin using clever folded light paths like Apple's tetraprism design, but that's nowhere near enough to produce the amazing photos we all take on our modern smartphones. That's where the techniques we'll be focusing on today come into play.

Machine Learning is Central

Long before Large Language Models and generative AI because common household concepts, there has been Machine Learning, or ML. This is where you teach artificial neural networks to process data in some kind of optimised way. You give the network billions of sample inputs, grade it on how well the outputs it produces score on some kind of metric of your choice, tweak the network, and repeat, not just once or twice, but millions or billions of times until those neural networks are experts at performing your desired data processing.

Initially the ML was just handling the signal processing needed to translate raw and noisy sensor data into images, but now there is ML optimising countless decisions throughout the entire process, and much of that machine learning is implemented in hardware to really speed it up. This is the so-called image pipeline Apple like to brag about making better with each iPhone release.

Initially, that image pipeline simply converted a single exposure on one sensor through one lens into a singe photo, but now, the inputs to that pipeline can be much more complex.

The Toolbox of Techniques

Let's look at the range of techniques modern image pipelines utilise to assemble our smartphone photos.

Single-Sensor Frame Stacking

If you want less noise in your image, you need more signal! The first trick the image pipelines learned was to fire the sensor multiple times for each requested photo and combine the data from those multiple exposures together to boost the signal to noise ratio. This requires more than just averaging each pixel across each frame, you also need to auto-align the image to deal with movement, which is of course more work for the machine learning!

If you shoot each frame at different exposure time, you can use this technique to boost the dynamic range as well as the signal-to-noise ratio. You shoot one at the 'correct' exposure, one a little shorter to get more detail in the highlights, and one a little longer to get more detail in the shadows, or perhaps even more if the scene has an extremely wide dynamic range, or if the scene is particularly dark.

To optimise this approach sensors can be flipped into video mode, and the video stream combined to make the Harry-Potter-like moving photo effect, or, when pushed to the extreme, to achieve the so-called night mode many modern smartphones offer.

As well as stacking multiple frames at multiple exposures, or a constant stream of frames as if shooting a movie, it's also possible to alter the focus distance between frames to boost the depth of field when shooting things that are very close to the lens. This is one of the two techniques power the macro modes on smartphones.

Multi-Sensor Frame Stacking

The second trick macro modes rely on is the logical evolution from combining frames through a single lens onto a single sensor into combining frames from multiple sensors shot through multiple lenses into a single final image.

When iPhones switch to macro mode the ML is combining multiple frames from two of the phone's physical cameras together to produce the image. This is why the iPhone Air doesn't have a macro mode — it only has one back camera!

As well as powering the macro mode, combining simultaneous exposures from multiple lenses also enables high quality shots between the zoom levels that match the physical cameras on the backs of our phones. In early phones these intermediate zooms were always low quality because they relied on interpolation from the lens below, or cropping and zooming from the lens above, but now they are true blends from both lenses either side of the chosen zoom level.

Pixel Binning

Early smartphone sensors were so cutting edge that the thought of having more pixels on the sensors than were present in the final image was insane — there were so few pixels to start with, why would you throw data away?

Now, sensors have matured to the point that it makes sense to have more physical pixels than are used in the final photos some of the time.

For example, you can ask the iPhone to give you the highest possible resolution in your photos, in which case, the 48MP sensors are used to create 48MP photos at the zoom levels matching the physical cameras, but that's not the default. The default is to give you 24MP or 12MP photos from those 48MP sensors. What's happening is that data from multiple single pixels are being combined into each output photo pixel to reduce the noise and increase the quality.

The other thing that these big sensors allow is for good images that only use a sub-set of the sensor. If you take the iPhone's 1x camera and use all the pixels, binned or not, you get that 1x zoom, but if you use just the centre part of that sensor you get a 2x zoom where you still have one physical pixel for each pixel in the final photo, or, to use Apple's lingo, an optical-quality zoom. The same trick gets you from the physical 4x lens on the iPhone 17s Pro to an 8x optical-quality zoom.

The new iPhones also take this same concept in a slightly different direction to offer vertical and horizontal aspect ratios at the same resolution without needing to rotate your phone. This has been achieved by making the sensor in the front-facing cameras square, so both full-size horizontal and vertical strips of pixels are available to the image pipeline. The pixels in the four corners will never be used in these two modes, but they are used when the same sensor is used for video recording with the motion compensation and/or smart zoom feature that automatically pans around to keep you in shot was you move. What's really happening is that a wandering sub-set of the sensor is being used so there is still one phyiscal pixel per video pixel, but which part of the sensor those pixels are from drifts around as you move the phone and/or your head!

Lidar

To get around the shallow depth of field inherent to small sensors lidar sensors can be used to add a depth-map to the image. The Lidar resolution is less than the sensor resolution, so it's not that image pipleline knows the exact distance to each pixel, but it does get the average distance for groups of pixels, and that can be used to very accurately simulate a shallower depth of field.

This trick evolved from using just a handful of Lidar sensors for fast auto focus to using a coarse grid of Lidar pixels on just the front-facing camera for face recognition to using ever more precise Lidar sensors on both sides of the camera to give ever better so-called portrait modes.

Generative AI

Finally, I should mention a tool Apple has chosen not to use in its camera, but others are choosing to use — generative AI. Apple have drawn a line in the sand between shooting and editing, offering generative feature when editing, but only using genuine data from one or more sensors when shooting. Some smartphone makers have chosen to work around physical limitations by adding generative AI into their image pipelines to literally confabulate (or hallucinate) data that was never captured to make the photos look better.

Personally, I think it's dishonest and deceptive to just invent data when shooting photos or videos, so I'm all in favour of Apple's approach. There are times I choose to use generative AI when editing artistic shots, but I never want that to happen behind my back!

Final Thoughts

Every time you tap that shutter button on a modern smartphone and a preview image appears in what feels like an instant, don't assume the process was simple! There is an amazing amount of cutting edge technology sitting between the front of that lens and the image you see in your screen just a few mm behind! Every now and then, take a moment to appreciate just how amazing it is that we can capture such wonderful photos on a device small enough to fit in our pockets that we have with us all the time!

Let's Talk Apple & Let's Talk Photography