데이터 세트 변경 후 기존 표준 편차를 사용하여 새로운 표준 편차 계산 nnn real values, which has mean

I have an array of

n

real values, which has mean

μold

and standard deviation

σold

. If an element of the array

xi

is replaced by another element

xj

, then new mean will be

μnew=μold+xjxin

이 방법의 장점은 값에 관계없이 일정한 계산이 필요하다는 것 입니다. 계산에 대한 접근도는 σ N E w 사용 σ O L D를 의 계산과 같은 μ n은 전자 w 사용 μ O L D는 ?

n

σnew

σold

μnew

μold



답변

“분산 계산하기위한 알고리즘”에 대한 위키 백과의 문서 섹션 방법 요소가 귀하의 관찰에 추가하는 경우 분산을 계산하는 방법을 보여줍니다. (표준 편차는 분산의 제곱근입니다.) x n + 1을 더 한다고 가정합니다.

xn+1

을 배열에 추가 한 다음

σnew2=σold2+(xn+1μnew)(xn+1μold).

EDIT: Above formula seems to be wrong, see comment.

Now, replacing an element means adding an observation and removing another one; both can be computed with the formula above. However, keep in mind that problems of numerical stability may ensue; the quoted article also proposes numerically stable variants.

To derive the formula by yourself, compute

(n1)(σnew2σold2)

using the definition of sample variance and substitute

μnew

by the formula you gave when appropriate. This gives you

σnew2σold2

in the end, and thus a formula for

σnew

given

σold

and

μold

. In my notation, I assume you replace the element

xn

by

xn

:

σ2=(n1)1k(xkμ)2(n1)(σnew2σold2)=k=1n1((xkμnew)2(xkμold)2)+ ((xnμnew)2(xnμold)2)=k=1n1((xkμoldn1(xnxn))2(xkμold)2)+ ((xnμoldn1(xnxn))2(xnμold)2)

The

xk

in the sum transform into something dependent of

μold

, but you’ll have to work the equation a little bit more to derive a neat result. This should give you the general idea.


답변

Based on what i think i’m reading on the linked Wikipedia article you can maintain a “running” standard deviation:

real sum = 0;
int count = 0;
real S = 0;
real variance = 0;

real GetRunningStandardDeviation(ref sum, ref count, ref S, x)
{
   real oldMean;

   if (count >= 1)
   {
       real oldMean = sum / count;
       sum = sum + x;
       count = count + 1;
       real newMean = sum / count;

       S = S + (x-oldMean)*(x-newMean)
   }
   else
   {
       sum = x;
       count = 1;
       S = 0;
   }

   //estimated Variance = (S / (k-1) )
   //estimated Standard Deviation = sqrt(variance)
   if (count > 1)
      return sqrt(S / (count-1) );
   else
      return 0;
}

Although in the article they don’t maintain a separate running sum and count, but instead have the single mean. Since in thing i’m doing today i keep a count (for statistical purposes), it is more useful to calculate the means each time.


답변

Given original

x¯

,

s

, and

n

, as well as the change of a given element

xn

to

xn

, I believe your new standard deviation

s

will be the square root of

s2+1n1(2nΔx¯(xnx¯)+n(n1)(Δx¯)2),


where

Δx¯=x¯x¯

, with

x¯

denoting the new mean.

Maybe there is a snazzier way of writing it?

I checked this against a small test case and it seemed to work.


답변