CN Resource Requirements
-------------------------------------------

This document provides an overview of estimates of the resources needed to run a CN node. It starts with an overview of the current resource baseline based on the cn-ucsb-1 processes.  We need to add a more realistic estimate under load.


Current minimum baseline
-----------------------------------------
The development CN instances provide an absolute minimum baseline because they show that majority of processes but with minimal data and no load.  Here's the baseline:

Storage

14GB used out of 569G disk available

Memory

Total: 6GB used out of 31GB total on machine, + 1 GB used for swap

Memory by major processes:

       Process   VirtualMem (MB)   ResidentMem (MB)       
        Apache                     1284                            30       
        Tomcat                   10481                         5472       
        MySQL                       261                            53       
        Postgres                     937                            74       
        Total                       12963                         5629  

Tomcat memory settings: 

    -XX:MaxPermSize=128m -Xms8192m -Xmx8192m
    

KNB Resource usage (for comparison)
-----------------------------------------------------------

Storage:
Metacat XML Documents
933MB on filesystem for 41482 XML documents, averaging 23Kb/document

Metacat data storage:
205GB for 33,466 data objects, averaging 6.1MB/object

Postgres DB size
56GB for 41482 documents + 33466 data objects, averaging 783Kb/object

Memory:
Total: 3GB used out of 15GB on machine, + 3GB used for swap
    
Memory by major processes:

          Process   VirtualMem (MB)   ResidentMem (MB)       
          Apache                       1129                             30       
          Tomcat                     21094                          2969       
          MySQL                            0                               0       
          Postgres                     6859                           542       
          Total                         29082                          3541  


Crude Projection
---------------------------
At these rates, if we project storage linearly out to 100K, 500K, and 1000K documents, and we project memory out as a function ofour needs will be:

                       41K   100K    500K   1000K       
Storage (GB)     262     632    3158     6316       
Memory (GB)      28       68      342       685  

Why is Tomcat taking 6GB of memory on the development server? Does this have anything to do with the differences in software stacks being run? 3GB of memory difference seems huge.
    Yes, probably -- in addition to Metacat, the CN runs mercury under tomcat, as well as the CN servlets.  More servlets = more memory.

Also, even though the Postgres database is much larger in the case of KNB, the difference in the active Postgres process related memory (ResidentMem (MB)) between the two is only 400mb which is small in comparison with the memory usage by Tomcat.
    Yep, and I suspect the postgres memory usage will not scale with db size, except maybe for indices.  Mainly it will scale with the # of pg processes and size of the global area.

One more point, these numbers seem to be calculated using only VirtualMem (that's how I got the same numbers in the projection for Memory). If this is so, we won't have a good idea of the actual memory footprint of the service as a whole as this is swap and not actually in the RAM. 

In the process table, VIRT = RES + SWAP.  I reported VIRT and RES.  Resident memory usage (RES) is important because it shows what is in memory now.  Virtual memory usage is important because it represents the size of the whole image, and so may be in physical RAM at some points in time when swapped in.  I omitted SWAP because it can be calculated and doesn't tell us any more.

Postgres difference:        542-74      =     468 MB
Tomcat difference:          2969-5472 =  -2503 MB

This seems to suggest that the difference in deployed tomcat services between DataONE and KNB will have more of an effect on RAM footprint than the first 200K documents.
Yeah. I think a lot of the difference for Tomcat is that the CN also runs mercury in tomcat, which has a lot of memory for indices and SOLR, etc.

(Postgres difference)/41K files * 200K files    468/41*200=2282 MB

I don't expect PG memory to scale linearly with database size (see above). 

Of course this does have quite an effect on the swap space used on the disk...

Postgres difference:          6859-937        =       5922 MB
Tomcat difference:            21094-10481   =     10613 MB

For the same 200K documents, you might expect to use 

(Postgres difference)/41K files * 200K files    5922/41*200=29 GB for Postgres

That all assumes linear memory scaling, which I don't think will happen. 

I can't think of a reason the Tomcat service's needs would grow with the # of documents, but if it did... scratch that, I think I can as more and more of the database was being accessed by users...

(Tomcat difference)/41K files * 200K files        10613/41*200= 52 GB for Tomcat on disk. 

And as we approach 1e6 documents....

Postgres mem: 468/41*1000      =    11.4 GB
Postgres swap: 5922/41*1000    =     144 GB
can't estimate Tomcat's memory increase due to the difference between KNB and D1CN,
Tomcat swap: 10613/41*1000    =     258 GB