So if I want to calculate life expectancy for 2022, I would
- create a newborn cohort (excact size doesn't matter here I suppose)
- calculate the risk of death for each age group, using "deaths per capita in 2022"
- then I just go through the risk array, remove a proportion of individuals from the cohort each year (risk=proportion removed?) until they're all gone.
- for each age I use the number of individuals that were removed as my weight for the respective age
Correct. However, one step is still missing. We need the mortality risk for each age year, but we only have the values of cohorts ("Sonderauswertung Sterbefälle"). If we evaluate the cohorts as a whole, the mean age at death is not correct, because the risk of death increases with age.
The way out of this problem is to interpolate the missing values. This works pretty well under the assumption of an exponential increase.
Yes you mentioned that. I interpolate data all the time. Tricky for German data with those weird older cohorts though, eh? Some of them have neither linear nor exponential trends in them, but a "knick" instead - for lack of a better english word. ;)
Indeed, tricky. If you calculate annually, the "knicks" in the pop pyramid produce ugly distortions in the risk-age function that jump from year to year. So I decided to calculate on a weekly time basis. That is, estimate the population of all ages for each calendar week.
I only used that site personally and changed what is being displayed a number of times. You can get an idea of how bad it is there. Anything past mid-2021 was extrapolated assuming average non-death-change of the past years for each age group, deaths were subtracted as they occurred.
John Dee recently talked about how unreliable population estimates are in one of his articles. I'm only just beginning to work with data and judging from the sloppiness I am seeing just about everywhere I am starting to understand what he's talking about.
I don't know about the US figures, what's the problem there?
In general, I tend to consistently refuse to work if the source data is garbage. That's why I don't investigate adverse event databases or so-called "COVID deaths".
As you probably know, for DE at each turn of the year the official stock is given for each age. In between, it is a simple matter (linear interpolation over 52 weeks). The period since the last report is more difficult. The key is to use knowledge where possible, so no trend estimates from past times and such. Estimate the development from given deaths and laws inherent in the system.
We know the number of all people at a given age at the beginning of the year. Each cohort has margins where birthdays change something. -> Account for that week by week.
Every week people die from the cohort. -> Subtract them, separately for those who remain in the cohort and those who change cohorts as a result of birthdays. You estimate their risk with the given risk-age function. Again, week by week.
This method can be backtested with official numbers of the past. The errors are about factor ten below the fluctuations we are interested in. However, because this error is adding up over the weeks, it averages out to only half. :-)
Yes. Backtesting. Excellent. I never even thought of it.
You're absolutely right. I should basically just simulate the process in more detail. I was mangling a huge amount of data at the time and already felt good about myself for simulating the deaths like you are suggesting, but I took a shortcut in the last year for the non-death changes, immigration, birthdays.
Immigration is a big issue. I am not sure it is handled adequately.
The USA have a gap in mortality data between 2 of their mortality datasets. That's probably the worst flaw I've found. It's a difference of more than 1% of excess mortality for the age group 0-64, depending on whether you look at it directly by adding up groups or subtract the 65+ group from All Cause in the Select Cause dataset. For the 0-24 age group the difference is off the charts.
So if I want to calculate life expectancy for 2022, I would
- create a newborn cohort (excact size doesn't matter here I suppose)
- calculate the risk of death for each age group, using "deaths per capita in 2022"
- then I just go through the risk array, remove a proportion of individuals from the cohort each year (risk=proportion removed?) until they're all gone.
- for each age I use the number of individuals that were removed as my weight for the respective age
- add all elements (weights applied) up
- divide by the sum of all weights
Right? I always wanted to know how this works.
Correct. However, one step is still missing. We need the mortality risk for each age year, but we only have the values of cohorts ("Sonderauswertung Sterbefälle"). If we evaluate the cohorts as a whole, the mean age at death is not correct, because the risk of death increases with age.
The way out of this problem is to interpolate the missing values. This works pretty well under the assumption of an exponential increase.
Yes you mentioned that. I interpolate data all the time. Tricky for German data with those weird older cohorts though, eh? Some of them have neither linear nor exponential trends in them, but a "knick" instead - for lack of a better english word. ;)
Indeed, tricky. If you calculate annually, the "knicks" in the pop pyramid produce ugly distortions in the risk-age function that jump from year to year. So I decided to calculate on a weekly time basis. That is, estimate the population of all ages for each calendar week.
I recently did that for the US data and realized how awful those population estimates are.
https://usa.pervaers.com
I only used that site personally and changed what is being displayed a number of times. You can get an idea of how bad it is there. Anything past mid-2021 was extrapolated assuming average non-death-change of the past years for each age group, deaths were subtracted as they occurred.
John Dee recently talked about how unreliable population estimates are in one of his articles. I'm only just beginning to work with data and judging from the sloppiness I am seeing just about everywhere I am starting to understand what he's talking about.
Not from you though :)
I don't know about the US figures, what's the problem there?
In general, I tend to consistently refuse to work if the source data is garbage. That's why I don't investigate adverse event databases or so-called "COVID deaths".
As you probably know, for DE at each turn of the year the official stock is given for each age. In between, it is a simple matter (linear interpolation over 52 weeks). The period since the last report is more difficult. The key is to use knowledge where possible, so no trend estimates from past times and such. Estimate the development from given deaths and laws inherent in the system.
We know the number of all people at a given age at the beginning of the year. Each cohort has margins where birthdays change something. -> Account for that week by week.
Every week people die from the cohort. -> Subtract them, separately for those who remain in the cohort and those who change cohorts as a result of birthdays. You estimate their risk with the given risk-age function. Again, week by week.
This method can be backtested with official numbers of the past. The errors are about factor ten below the fluctuations we are interested in. However, because this error is adding up over the weeks, it averages out to only half. :-)
Yes. Backtesting. Excellent. I never even thought of it.
You're absolutely right. I should basically just simulate the process in more detail. I was mangling a huge amount of data at the time and already felt good about myself for simulating the deaths like you are suggesting, but I took a shortcut in the last year for the non-death changes, immigration, birthdays.
Immigration is a big issue. I am not sure it is handled adequately.
The USA have a gap in mortality data between 2 of their mortality datasets. That's probably the worst flaw I've found. It's a difference of more than 1% of excess mortality for the age group 0-64, depending on whether you look at it directly by adding up groups or subtract the 65+ group from All Cause in the Select Cause dataset. For the 0-24 age group the difference is off the charts.
Other states are okay. It's all so shady.