Project name: Modelling petroleum coke refinery gross output (2026)

Author : Seydou DIA

Last update: 06-03-2026

Contact: Linkedin

For more projects on Data Science and energy click here > Data Science Portfolio

Information¶

Estimating refinery yields is a complex endeavour that is very important for market players, oil statisticians, analysts and modellers. When not available, yields can be very complex to estimate as they highly depend on:

  • Type of crude used
  • Refinery secondary units and their configurations
  • Product cracking prices
  • Refineries' market strategies

In this project we explore possibilities to model very simply yearly petroleum coke output at country level based solely on yearly crude throughput, atmospheric distillation capacity and coking unit capacity. The data is based on official refinery gross output available in the International Energy Agency's Oil Information database and refinery data collected from a combination of official and secondary sources. For confidentiality purposes, the data has been anonymised and a random digit is associated to each country. The code to process the data and generate outputs is available on the repository.




Petroleum coke and coking units¶

Petroleum coke is a black solid by–product, obtained mainly by cracking and carbonising petroleum derived feedstock, vacuum bottoms, tar and pitches in processes such as delayed coking or fluid coking. It consists mainly of carbon (90 to 95%) and has low ash content. It is used as a feedstock in coke ovens for the steel industry, for heating purposes, for electrode manufacture and for production of chemicals. The two most important qualities are "green coke" and "calcinated coke". This category also includes "catalyst.

Thermal Cracking is a refining process in which heat and pressure are used to break down, rearrange, or combine hydrocarbon molecules. Coking, is a refining thermal cracking process which produces light and intermediate distillates by the thermal cracking of molecules of higher molecular weight. As a by-product of this process, fuel gas and petroleum coke is obtained.

We will first explore our data then model it very simply using a linear regression.

In the chart above we are plotting the absolute values of petroleum coke output versus the total thermal cracking (of-which coking) capacity at the country level between 2006 and 2023. The size of the circle represents the size of the aggregated atmospheric distillation capacity at the country level and the color represents the share of petroleum coke output in percentages of crude throughput.

It seems there is a linear relationship between the size of the thermal coking units aggregated at country level and the amount of petroleum coke produced as we were expecting. When looking at the size of circles it does seem that the smaller the atmospheric distillation capacity, the less the output in absolute terms. Nonetheless, it does seem that there isn't any clear relation with the yield of petroleum coke and the size of the coking units. In the observed dataset, the average share of input is 3% with the highest share being 20% for a country that as the smallest refinery (1309 KT/Y).

Although we seem to have a linear relationship, the chart above can be hard to analyze given the differences in refinery sizes across countries. In order to better validate our assumption and facilitate the vizualisation, we propose to introduce a new indicator.

$$ \frac{\text{ThermalCokingCapacity}_{i,t}}{\text{AtmosphericDistillationCapacity}_{i,t}} $$

$ AtmosphericDistillationCapacity_{i,t}:\text{Atmospheric distillation capacity of country \( i \) at year \( t \)}. $

$ {ThermalCokingCapacity}_{i,t}:\text{Thermal Coking Capacity of country \( i \) at year \( t \)}. $

Normalizing thermal coking capacity by the atmospheric distillation capacity will help us to:

  • Account for the importance of this unit at the country level in comparison to total unit
  • Simplify the comparison between countries
  • Avoid bias of scale when we will start modelling output versus capacity

We kept our linear relation on the chart above. We do see several clusters of refineries. Those that are closer to the origin with atmospheric distillation capacity of a few dozen thousand. Moreover, we do see now countries with smaller distillation capacities on the right hand-side of the chart. This indicates that the bigger the coking units relatively to the atmospheric distillation capacity, the greater the yield of petroleum coke will be in percentages of crude throughput.

Now that we have identified a potential linear relation in our data, we can simply model it using a linear regression. We will start with a simple model:

$$ y = a.x + b $$

We see that our coefficient is statistically significant with a p-value of 0.000. The R² value is 0.601 which we can consider acceptable for our application.
We could continue the study by:

  • Further validating our model by verifying the assumptions of a linear regression
  • Integrating other variables in the model such as petroleum coke market prices
  • Explore other types of model to increase accuracy (2nd order, log, etc.)

Before concluding we wanted to highlight a some key concepts limits to the study. This model can be easily scaled to countries where there is a limited number of refineries as having different types of refineries could lead to substantial differences in yields at country level.

Those models can't necessarily be used to estimate exact petroleum coke output but can help estimate the order of magnitude we should expect in a country. Moreover, we decided to use share of crude throughtput as an independant variable because it can be inferred easily when lacking data sources asimplied refinery intake = crude production + imports - exports +/- stock change.

Petroleum coke is linked to one specific unit. Estimating yields of other products such as gasoline, diesel or kerosene that are high value products would require more complex models that include refinery configuration, crude density, prices assumption. Using a linear optimization model is one of the possibilities and this is something we will explore in future projects.

Finally, it is important when deploying such models to retrain them on a regular basis as new units are being commissioned / decommissioned to take into account the evolution in technologies.

Conclusion¶

Linear regression is often the simplest and most effective starting point; resorting to complex statistical models or machine learning ones is usually overkill, demanding heavy computation, extensive training, and maintenance. Share of input is better because implied refinery throughput can easily be inferred based on total production, imports and exports.