Which outliers to delete




















Without it, there is no relationship between X and Y, so the regression coefficient does not truly describe the effect of X on Y. One option is to try a transformation. Square root and log transformations both pull in high numbers. This can make assumptions work better if the outlier is a dependent variable and can reduce the impact of a single point if the outlier is an independent variable.

Another option is to try a different model. This should be done with caution, but it may be that a non-linear model fits better. For example, in example 3, perhaps an exponential curve fits the data with the outlier intact. Whichever approach you take, you need to know your data and your research area well. Try different approaches, and see which make theoretical sense. Thank you for this explanation, it is really helpful.

Is there an academic article or book that I can refer to when using these guidelines in my thesis? Respected Karen! Can you please add or send me the reference of this justification. Advance thanks. In plot number 2, I do not understand why you want to drop the outlier?? To my point of view, it tells you that the model is rather robust. Remind that a statistical model should only been apply for prediction within the data range used for its calibration.

The larger the data range, the more robust it will be for predicting in new situations. When cleaning a large dataset for outliers, does a separate outlier analysis have to be run for every single regression analysis one plans on running? For instance, does running 30 different regressions typically require 30 separate outlier analyses? If so, do the outliers need to be added back into the data set before running the next outlier analysis? If multiple outlier analyses are not required in this case, is just one outlier analysis enough i.

After checking all of the above, I do not understand the rationale for keeping an outlier that affects both assumptions and conclusion just by principle. In a survival analysis, maybe somebody died of a car accident but you dont have the death certificate. Biomarkers cant predict that, neither can most genes.

It is not really the outlier there is anything wrong with, but the inability of most parametric tests to deal with 1 or 2 extreme observations. If robust estimators are not available, downweighting or dropping a case that changes the entire conclusion of the model seems perfectly fair and reporting it.

In example two, the outlier should have little effect on the slope estimate but it ought to have a BIG effect on the standard error of the slope estimate. It would definitely be worth investigating how it came about. A lot might depend on the physical situation involved, whether we are dealing with correlation or with truly independent and dependent variables, etc. Can we remove outliers based on CV.

To lower down CV, change the replication data value but without any change the mean value of treatment. I tried this in some study and the effects are not trivial. First, my data had some observations which clearly were quite far from the mean sd of over I included them and my parameters were significant all through. I am analysing household consumption expenditure and conclusions based on outliers will most probably be unrepresentative. I tried the robust errors suggested here as well.

I think with outliers their effect is inflating the variances and hence parameter significance robust errors should be enough, as much as we trust the underlying framework.

What happens if you take out the outlier, and things become more significant? What would you do in this situation? I have multivariable logistic regression results: With outlier in model p-values are as follows age When I take out the outlier, values become age So by taking out the outlier, 2 variables become less significant while one becomes more significant.

I used a square root to transform the IV weight. If I use this variable the R2 of my model decreases. So first of all, what is an outlier? Of course, there will always be some points that are further away from the mean than the others.

Even so, they are not outliers unless they are very far away from the rest of the data — just being a little bit further is not enough. It is clear that having numerous outliers in the data, say one in every dose group, is undesirable. However, what is not so obvious is that even one outlier can potentially cause problems. For example, looking at the figure above — at first glance everything looks pretty clear even if the points had not been colour-coded , there is a linear dose-response relationship which runs through all of the points except the outlier… and then there is the outlier doing its own thing.

The implications of this one outlier on the overall model are seen when you try to fit a model to the data — the best fit will try to compromise. It will run close to the majority of the data, but it will also try to take into account the outlier, so the result is it ends up landing somewhere in between. This is shown in the figure below, where the best fit including all the data is the red line.

It passes above most of the data at the low doses because of the high outlier at the bottom dose. For comparison, removing the outlier entirely and fitting the remaining data gives the green line.

There is clearly quite a large difference between the fits — in particular, the red line has a smaller slope than the green line. In a bioassay, this could mean that the assay fails system suitability criteria, or if it passes, the relative potency estimate could be inaccurate.

Given the problems outliers cause, the simplest solution appears to be just remove them. However, this is not as simple as it sounds, as it is not always obvious from the data what an outlier is and what it is not. In the figures above, the point at the lowest dose with a very high response around 7 is clearly an outlier.

If the response was much lower e. But in-between there are a range of values around 4 or 5 where it might not be as clear whether this is an outlier or just a slightly high response. Making a manual judgement in such cases would be difficult, and manual judgements are subjective in any case — different people might make different decisions on whether a particular point is an outlier or not.

For consistency and for regulatory reasons it is better to use an automatic method of detecting outliers. Both are commonly available in software packages used for bioassay analysis, including our commercially available software, QuBAS. The adjustment step is a bit different in the two methods, which can lead to different points being identified as residuals in each case. A completely different method of removing outliers is to try and transform the data.

Sometimes a response that looks like an outlier can actually be due to increased variation in one part of the dose-response curve.

For example, in the figure below, the responses at the lowest dose are very far apart so it looks like one of them is an outlier although there is no way of telling which one. Now the bottom dose group does not look unusual — the responses are no further apart than at several of the other doses. Therefore, there are actually no outliers here at all!

For more about transforming your data, see our previous blog on the topic:.



0コメント

  • 1000 / 1000