Correct pool model & decent statistical predictions for the next draws

Michau · Mar 28, 2015

OK, so now we are after 6 draws, we can create a decent statistical model of the pool based on the Express Entry spreadsheet.

We know (just like I predicted here: http://www.canadavisa.com/canada-immigration-discussion-board/a-possibly-plausible-theory-about-the-number-of-people-in-the-pool-and-scores-t271152.0.html) that the first draws had unusual number of 600+ applicants due to errors, and this is now confirmed by many people who got 600 points added by CIC's mistake. So the results of first draws are really irrelevant and we should not look at them.

The last two draws give us much better information, and we can see from them that the current EE spreadsheet contains about 4-5% of the pool, just like the previous year's spreadsheet for FSW (nobody should be really surprised by that).

So let's build a statistical model based on the current Excel contents. We need the following parameters:

* The relation of the spreadsheet population to the real pool population, let's assume it being from 4.0% to 5.5%.

* The distribution of people scores. Let's take the list of all scores from the spreadsheet below 600 and it will give us a good score distribution between applicants; not a perfect one, but definitely good enough. Then there is also this +600 factor, we will take it into account by introducing a probability of someone receiving LMIA or PNP. In the spreadsheet, 48 people out of 974 have above 600 points so we assume that this probability of someone having LMIA/PNP is around 5%.

* The number of people in the pool on day 0 (which is today). The spreadsheet contains today 813 people without ITA (that is, below 453 points).

* The number of people being added to the spreadsheet daily. From March 1 to March 28 there were 174 new entries in the spreadsheet, this gives on average 6.2 new entries per day.

* The dates and counts of future draws. Nobody knows these of course, but let's assume 2 draws per month, 1600 people each - this looks like a good assumption based on the draws done in March, and also based on immigration targets for 2015.

That's all the assumptions we need. They are not very accurate yet, but at this point should be accurate enough to build a simple model. So this is the actual model parametrized (sorry it's a PHP code - I'm a Web designer

):

http://pastebin.com/wEHvGwyB

What does the model tell us? Based on the assumed relation of the spreadsheet population to the real pool, we get the following results:

If the spreadsheet represents 5.5% of the pool:

April 1st draw: cutoff score = 437
April 2nd draw: cutoff score = 427
May 1st draw: cutoff score = 415
May 2nd draw: cutoff score = 409
June 1st draw: cutoff score = 400
June 2nd draw: cutoff score = 394
July 1st draw: cutoff score = 388
July 2nd draw: cutoff score = 384

If the spreadsheet represents 5.0% of the pool:

April 1st draw: cutoff score = 438
April 2nd draw: cutoff score = 429
May 1st draw: cutoff score = 419
May 2nd draw: cutoff score = 413
June 1st draw: cutoff score = 408
June 2nd draw: cutoff score = 401
July 1st draw: cutoff score = 396
July 2nd draw: cutoff score = 392

If the spreadsheet represents 4.5% of the pool:

April 1st draw: cutoff score = 439
April 2nd draw: cutoff score = 432
May 1st draw: cutoff score = 423
May 2nd draw: cutoff score = 417
June 1st draw: cutoff score = 412
June 2nd draw: cutoff score = 409
July 1st draw: cutoff score = 403
July 2nd draw: cutoff score = 399

If the spreadsheet represents 4.0% of the pool:

April 1st draw: cutoff score = 441
April 2nd draw: cutoff score = 435
May 1st draw: cutoff score = 428
May 2nd draw: cutoff score = 421
June 1st draw: cutoff score = 416
June 2nd draw: cutoff score = 413
July 1st draw: cutoff score = 411
July 2nd draw: cutoff score = 407

Personally I am leaning towards 4.5% - 5.0%. So that means that we should reach 400 around June/July.

Another interesting value is the final value when the draw score stabilizes. This is highly dependent on the ratio of new people added to the pool, to the number of people drawn, and this first parameter is derived from the spreadsheet and also not very accurate. But, just to give you a ballpark number, with the 5.0% spreadsheet-to-pool assumption, after two years (!) the model reaches somewhere around 360, and with the 4.5% assumption it reaches about 365. The actual stabilized score may be lower if the number of new people adding themselves to the pool will be dropping, or the number of people drawn will be higher than 3200-3300 per month as we see now. But generally I would say looking at these first approximations, that people above 360 should finally get their ITAs, though the wait may be really long.

EEBANKER · Mar 28, 2015

Good assumptions

Journey2PR · Mar 28, 2015

Michau said:
OK, so now we are after 6 draws, we can create a decent statistical model of the pool based on the Express Entry spreadsheet.

We know (just like I predicted here: http://www.canadavisa.com/canada-immigration-discussion-board/a-possibly-plausible-theory-about-the-number-of-people-in-the-pool-and-scores-t271152.0.html) that the first draws had unusual number of 600+ applicants due to errors, and this is now confirmed by many people who got 600 points added by CIC's mistake. So the results of first draws are really irrelevant and we should not look at them.

The last two draws give us much better information, and we can see from them that the current EE spreadsheet contains about 4-5% of the pool, just like the previous year's spreadsheet for FSW (nobody should be really surprised by that).

So let's build a statistical model based on the current Excel contents. We need the following parameters:

* The relation of the spreadsheet population to the real pool population, let's assume it being from 4.0% to 5.5%.

* The distribution of people scores. Let's take the list of all scores from the spreadsheet below 600 and it will give us a good score distribution between applicants; not a perfect one, but definitely good enough. Then there is also this +600 factor, we will take it into account by introducing a probability of someone receiving LMIA or PNP. In the spreadsheet, 48 people out of 974 have above 600 points so we assume that this probability of someone having LMIA/PNP is around 5%.

* The number of people in the pool on day 0 (which is today). The spreadsheet contains today 813 people without ITA (that is, below 453 points).

* The number of people being added to the spreadsheet daily. From March 1 to March 28 there were 174 new entries in the spreadsheet, this gives on average 6.2 new entries per day.

* The dates and counts of future draws. Nobody knows these of course, but let's assume 2 draws per month, 1600 people each - this looks like a good assumption based on the draws done in March, and also based on immigration targets for 2015.

That's all the assumptions we need. They are not very accurate yet, but at this point should be accurate enough to build a simple model. So this is the actual model parametrized (sorry it's a PHP code - I'm a Web designer ):

http://pastebin.com/wEHvGwyB

What does the model tell us? Based on the assumed relation of the spreadsheet population to the real pool, we get the following results:

If the spreadsheet represents 5.5% of the pool:

April 1st draw: cutoff score = 437
April 2nd draw: cutoff score = 427
May 1st draw: cutoff score = 415
May 2nd draw: cutoff score = 409
June 1st draw: cutoff score = 400
June 2nd draw: cutoff score = 394
July 1st draw: cutoff score = 388
July 2nd draw: cutoff score = 384

If the spreadsheet represents 5.0% of the pool:

April 1st draw: cutoff score = 438
April 2nd draw: cutoff score = 429
May 1st draw: cutoff score = 419
May 2nd draw: cutoff score = 413
June 1st draw: cutoff score = 408
June 2nd draw: cutoff score = 401
July 1st draw: cutoff score = 396
July 2nd draw: cutoff score = 392

If the spreadsheet represents 4.5% of the pool:

April 1st draw: cutoff score = 439
April 2nd draw: cutoff score = 432
May 1st draw: cutoff score = 423
May 2nd draw: cutoff score = 417
June 1st draw: cutoff score = 412
June 2nd draw: cutoff score = 409
July 1st draw: cutoff score = 403
July 2nd draw: cutoff score = 399

If the spreadsheet represents 4.0% of the pool:

April 1st draw: cutoff score = 441
April 2nd draw: cutoff score = 435
May 1st draw: cutoff score = 428
May 2nd draw: cutoff score = 421
June 1st draw: cutoff score = 416
June 2nd draw: cutoff score = 413
July 1st draw: cutoff score = 411
July 2nd draw: cutoff score = 407

Personally I am leaning towards 4.5% - 5.0%. So that means that we should reach 400 around June/July.

Another interesting value is the final value when the draw score stabilizes. This is highly dependent on the ratio of new people added to the pool, to the number of people drawn, and this first parameter is derived from the spreadsheet and also not very accurate. But, just to give you a ballpark number, with the 5.0% spreadsheet-to-pool assumption, after two years (!) the model reaches somewhere around 360, and with the 4.5% assumption it reaches about 365. The actual stabilized score may be lower if the number of new people adding themselves to the pool will be dropping, or the number of people drawn will be higher than 3200-3300 per month as we see now. But generally I would say looking at these first approximations, that people above 360 should finally get their ITAs, though the wait may be really long.

Hope it not true as i do not fit anywhere with 362 points ,.... lol

Michau · Mar 28, 2015

Journey2PR said:
Hope it not true as i do not fit anywhere with 362 points ,.... lol

Well, as I said the model is way too inaccurate for that so far

It highly depends on the the number of new people in the pool every month, and this number may be declining in the future because EE is a new system so now in the first months everyone is rushing to the pool. Have hope with 362

visaowl · Mar 28, 2015

How come your statistical prediction does not have any error limits? Depending upon the error limits, the model can be commented upon. Besides, how well do you think the predictions will stick because the model is based upon 5% of the population which itself can be error-prone. Good work though, would be good if you open-source it. (Oops:- Did not look at the pastebin link, the model is there.)

Comments:- How well do you think this line holds:- if(rand(0, 999)/1000 <= $LMIA_PROB) $score += 600;

I cannot understand PHP; can you by chance post algorithm of this one. For sure, the model is good initial assumption and can be improved since scores have dropped below 600. Will be a good model to find out the band into which draw scores will settle.

Michau · Mar 28, 2015

It is open sourced - the original post contains a pastebin link where you can get the source code.

The 5% assumption is to be improved as we see next draws, but currently it looks like a good approximation. I don't see it going suddenly to 2% or 10%, it will hover somewhere around this 5% which we have now. Another important factor is the number of people added daily (6.2 in March) which can also change in subsequent months.

There are no error margins given because actually it's based on simulation rather than probability calculations (these would be really really really hard). So you set the parameters, run the simulation and see the results. Different parameters give you different results. As you can see, changing the representation factor from 4.0% to 5.5% varies the next draw predictions only by 6% - I would say that is not bad.

The $LMIA_PROB you are asking about is the probability of someone having added 600 points due to LMIA or PNP, which I inferred from the spreadsheet to be about 5%. This again is some factor which of course can change in the future.

Generally the algorithm is really simple:

1. Get the score distribution from the spreadsheet. Create an initial pool with participants having this score distribution, but reject anyone with 453 or above points because they got ITAs already. This is the snapshot of the pool as we have it today.

2. Every day add X number of people according to the the score distribution (X depends on the assumed spreadsheet-to-pool ratio of course).

3. Every 16 days simulate a draw by sorting the pool and taking out the best 1600 people.

That's the whole algorithm and it's really straightforward. The parameters are what make the difference.

visaowl · Mar 28, 2015

Michau said:
It is open sourced - the original post contains a pastebin link where you can get the source code.

The 5% assumption is to be improved as we see next draws, but currently it looks like a good approximation. I don't see it going suddenly to 2% or 10%, it will hover somewhere around this 5% which we have now. Another important factor is the number of people added daily (6.2 in March) which can also change in subsequent months.

There are no error margins given because actually it's based on simulation rather than probability calculations (these would be really really really hard). So you set the parameters, run the simulation and see the results. Different parameters give you different results. As you can see, changing the representation factor from 4.0% to 5.5% varies the next draw predictions only by 6% - I would say that is not bad.

I think that is how error limits can be calculated, varying the input population and then improving it over the subsequent months after we get more data points from CIC; rather than it being a constant. It is hard for me to understand PHP; can you post algorithm please?

Would love to work this into something (a service maybe) more refined with you personally.

Michau · Mar 28, 2015

I have edited my post above to show the algorithm. It's something really simple, hard to even call it "algorithm" so far. It just simulates new people coming to the pool with correct score distribution, and then CIC taking the best people out with a draw.

atmtaatmta · Mar 29, 2015

This topic should be made sticky.
Best model by this far. Much better, than these 'guessing points for next draw' topics.

andycan · Mar 29, 2015

atmtaatmta said:
This topic should be made sticky.
Best model by this far. Much better, than these 'guessing points for next draw' topics.

i second

pakimom · Mar 29, 2015

I think this Analysis is brilliant! Its a far more educated approach towards what how we should expect the scores to move in the next few months.

+ 1 to you !

andycan · Mar 29, 2015

+1 for you

Mrnassaro · Mar 29, 2015

So based on the assumption , it might reach 350-380 by august ?

Michau · Mar 29, 2015

It's quite a stretch to answer that, because the model is iterative so it gets more and more inaccurate with every subsequent prediction. But based on the limited data we have so far, I would say: 350 rather not possible this year, 380 quite possible in August or September. Remember that the difference between 350 and 380 is very very huge.

andy000 · Mar 29, 2015

Michau said:
It's quite a stretch to answer that, because the model is iterative so it gets more and more inaccurate with every subsequent prediction. But based on the limited data we have so far, I would say: 350 rather not possible this year, 380 quite possible in August or September. Remember that the difference between 350 and 380 is very very huge.

Im under 383 at the moment, going for another ielts to score CLB9 , hopefully go straight to 470 in May. So far, one of the best model i´ve seen about the draws, keep em coming!

Search

Correct pool model & decent statistical predictions for the next draws

Michau

Star Member

EEBANKER

Star Member

Journey2PR

Full Member

Michau

Star Member

visaowl

Star Member

Michau

Star Member

visaowl

Star Member

Michau

Star Member

atmtaatmta

Star Member

andycan

Star Member

pakimom

Hero Member

andycan

Star Member

Mrnassaro

Hero Member

Michau

Star Member

andy000

Hero Member

Similar threads