I will respond also as my ideas are a bit different in terms of
implementation to Chad .
Thanks for the conversation by the way always good to have peer reviews.
1) Better performance is NOT The main goal , the real goal is to
improve reliability and security without reducing performance. That said I
believe we will achieve better performance ( esp through better parallelism
/ fast IPC ) it will be great for marketing that said it doesn't matter if
we don't as long as we are close. Note Singularity has the same goals I
place performance as more important than they do for example they showed a
10% improvement with IPC by services calling the device drivers without
going to IPC ( eg a Direct Call) , they wore this cost as acceptable I
probably will not.
2) Remember when OS went from Real mode to protected ( eg DOS to
windows X/NT) and how much performance we lost ? All the games and some
Cad systems all stayed with real mode for performance. Why did we break
real mode 1. Not enough memory (needed by Gui apps) 2 Reliability for
Multiple process support . Through a Type save managed applications
can provide the reliability and the memory issue is also no big issue as it
was mainly a DOS /16 bit real mode issue. Note you can even go 64 bit real
3) Cosmos is a kit we will be able to chop and change major parts of
the OS easily and by anyone , for example the whole MM I implemented so far
is changeable , could be restarted and can even sit on a different machine (
though that's not a good idea in case of network problems). With all the
kernel code being OO and C# someone can just replace my software MM with a
hardware VMM. This is only just possible with Linux and is a major effort
as its mature ( bloated) and lacks clearly defined interfaces.
MMU/TLB vs software paging - What ideas do you have for implementing paging
in software that is better than paging in hardware? If code or data-pages
are removed, previously compiled code would fail when it tried to access
invalid pointers without MMU faults.
Memory from the software paging system is only allocated in large blocks to
trusted components so far the only thing that requests memory are Type Code
library loader , Allocators( GC) , Shared Memory Allocator and DMA. It is
not possible to have an invalid pointer
For data programs can't create or modify pointers (accept for setting it to
null) . Only the GC can create a pointer . The only form of corruption
possible is a user could set the pointer to another object he manages of the
same type however he only has access to object created by that process .
For code C function pointers are not supported . The IL Library has the
method names it wants to call and the OS (Dynamic Loader) replaces it with
the appropriate address if it is valid when it loads. Self modifying code
and reflection are not supported.
The process which has pages swapped out would need to run in a special mode
which verified each code or data pointer (to see if that data was swapped
out) before it was accessed. This could be a VM interpreter, or it could be
an entirely different set of compiled code -- in either case it creates more
memory pressure because of different versions of code that need to exist. Is
this more or less expensive than an MMU/TLB? That's hard to say, I'd guess
it to be more expensive. If processes (or smaller granularity 'units') are
simply frozen and not allowed to run while swapped it would eliminate the
need for dual-codepaths, but it would also reduce the amount of reclamation
swapping could provide if units are still handling small frequent tasks.
No for a software manager it will work this way , the
scheduler or UI service sees all threads of a application module are idle
and if the machine is under memory pressure swap the process to disk. When
a thread needs to wake up it will resume the process. Please note these
process are not the large UNIX /Windows processes but small single threaded
object spaces that use IPC to talk to other processes. Like an OO service.
Because the IPC cost is almost equivalent to an internal call you can
modularize apps and increase reuse. Ie breaking an app into multiple
domains/processes is cheap and caries no penalty. I also intend to provide
these swap routines to the application developer as he knows the inactive
windows and can swap them ( Browsers anyone) .. Personally I think swapping
is just used because of bad programming not a single small device supports
it and smart phones and PDA s run happily with 64 Meg. Supporting these
devices in the short term is probably more important than PCs.
Software paging does have the nice property that it smartly "makes the
common case fast". (i.e. a system which isn't swapping has no overhead. Only
when we incur the HUGE overhead of secondary-storage swapping would we incur
the penalty of virtual memory checks.
Yes exactly also note there is an additional technique here when there is no
memory pressure you can run a Garbadge Collector that is fast and spends
little time compacting and runs infrequently if memory pressure changes you
can swap GC and the new GC collects more frequently and can use expensive
best fit methods.
2. Application API security. My thoughts in this space have been about
eliminating most (or all) requests to the user to approve heightened
security. Users simply click "yes" on most of these dialogs, so I think a
security system will be more effective if they never or seldom happen. (the
opposite of Vista UAC) Further, I think the primary compromise to prevent is
malicious code uploading unauthorized user data to a third party server.
One idea I've had is to create a set of interlocking isolation rules. For
example: consider two assemblies: (A) can access the network, and only it's
own private local storage; (B) can access all local data but not the network
or other processes. Assemblies of these two types can be intermixed on the
system without fear that unauthorized user data can be uploaded to the
Internet. -- Adding a secure open/save panel (like silverlight or a browser)
to authorize an application to access a specific piece of data, "A" becomes
similar to website security, since different websites can't access
eachother's backend servers. A surprising number of applications fit into
these two categories.
Chad mentioned "see rings in our docs". I couldn't find this reference on
the cosmos website. What is this a refernece to?
The rings is the default Cosmos OS , I don't think this will be good enough
. Cosmos B im proposing will be a full POLA/POLP Capability Object system
ie there is no ACL and limited background authority. References to objects
( with no public members) will serve as keys these keys will be persisted
and can be past to another process the process trusts. Without the key you
can create or forge the object , keys can be cross machine.
3. Potential performance improvements. Ben mentioned "if a loaded webserver
spends 50% of time in kernel". This sounds to me like a static webserver.
It's no surprise that solution which removes OS abstration layer cost can
optimize a dumb "data in data out" workload like this. Many tiny systems
(including the previously mentioned MIT Exokernel) have done this.
Managed OS have a LOT in common with ExoKernels especially CosmosB. However
they remove the main weakness of ExoKernels . Most web servers spend a lot
of time in the OS cache , NIC , Network stack , File system or Disk driver .
We arn't using them in the industry because this optimization isn't
important. Dynamic applications represent a much larger percentage of system
development costs, and they spend most of their time in userland. These
applications are not dominated by syscall or context-switch performance. I
argue that human resource (programmer) costs are the dominating factor in
most software applications. Reducing these costs is more important than a
fractional improvement to performance. This is why the trend is towards
programming systems which are inefficient for the machines and efficient for
the programmer. (like Ruby, Python on the extreme, Java, or CLR with garbage
collection in the less-extreme)
No argument on human resources however Cosmos support s .NET so the
development costs will be equivalent ( removing the ExoKernels weakness) ,
obviously we have an issue in admin costs for a server but the reliability
will be the major argument no one is happy when servers have issues. Also
remember the security , Military and hand device markets .
Provided an OS does parallelism well (which Windows and Linux don't but
Sing and Cosmos can - 32+ hardware threads which we will soon see in dual
servers will be a major issue) I agree with you, however note your earlier
comments on spending time optimizing ( batching) kernel calls , prob user
level caching etc . Why did you do this ? If a server like IIS is 30% more
efficient your prob looking at a hardware cost saved in the billions , sure
this doesn't matter on an individual site but it does matter for 3rd party
software such as SQL servers , web servers etc. An OS is a once of
development investment if a .NET Cloud runs an OS that is 10% faster and
runs all .NET apps you have a customer pretty quick as it's a big hardware
cost for some.
Overall, I'm skeptical of claims that this new architecture will have
similar capabilities to old architectures but with better performance. There
are tradeoffs, and while we can point at the inefficiencies of existing
systems, we can't yet point out all the tradeoffs and inefficiencies of a
full real-world system built in a Singularity-like model.
Note our major trade off Languages must be safe ( no pointers) and strongly
typed and applications must be installed before they run. What you are doing
is pushing the issues earlier in the development cycle we do a lot of the
hard lifting at compile time and impose more restriction on the developer no
pointers and strongly typed.
There are rumours some of MSDN runs on singularity . And MS is building
Midori hiring some major players like Shap. I don't buy the trade-off
argument , sure there are trade offs but sometimes you pay a small price
for a big gain eg caching provides a big perf gain but costs a small
amount of policy.
Pervasive garbage collection is a significant one of these costs.
The GC costs will be about the same as a.NET app running on Windows .
Similar to the quoted figures about how TLB misses are becoming more
expensive as CPU speeds increase, garbage collection is becoming more
expensive as CPU speeds and memory sizes increase. In other words, it's very
possible Singularity would have a similar overhead to produce the same
functionality, simply in different places.
I don't think this holds true. The alternative to a GC is manual memory
management with malloc and free which gets into the human resource costs
stray pointers etc . If the situation gets too bad we can do ref counting
GC's but at the moment the CPU cost of doing this is not justified ( accept
for Real time code) . Lastly the best way to speed things up is use less
memory and increase L2 cache use , the old 70s designs are having trouble
with parallelism and the fact memory access on these systems is hitting a
100 cycles. Cosmos /Singularity apps will be smaller and have less code (
since all devs will be forced to use the .NET libs) and we can optomize
things to suit the changes in HW architecture ( eg Asynch is prob better
than Synch now for IPC , Minix was years ahead of its time but the Ring3/0
model is its limitation) .
Perhaps the winning argument for an architecture like Singularity/Cosmos is
similar to the Aspect-Oriented-Programming argument. Currently we're
spending programmer time optimizing these boundaries, wheras if we make the
boundaries very cheap we don't have to. This is will make our code smaller,
easier to write, easier to maintain, and as a nice side effect, faster in
the common cases. The code required to handle the uncommon cases might even
still exist, but if it can only exist in centralized cross-cutting places,
then the overall codesize can be simpler.
Yes exactly. Please note we have just touched on performance and haven't
discussed many of the topics, you can see a modern OS based on this has
much promise with the biggest gains coming in parallelism , reliability ,
security and maintainability ( of the OS) . Consider a TDD kernel also.
4. Passing data between domains. From the Singularity paper, it seems they
concieved a way to make a code-constraint that enforces a given pointer
handoff is the only way to access a data-subtree.
This is the nature of Strongly Typed systems (which IL is) if you don't
have a reference you cant create or forge the pointer. This is well
discussed in Capability systems.
This allows them to safetly hand the pointer to another domain without
worry that it can be modified by the original process. They still have to
track lifetime for this subtree with some kind of foreign GC handle. I don't
understand how they would implement a zero-copy buffer cache using this
system, as once they handoff the pointer, they can't retain a reference to
the buffer anymore.
Once they set it to null its gone , but the receiver can send the reference
back. hence the reference is a key. The GC is the main issue.
I like Ben's suggestion to encourage 'immutable data'. One chalenge lies in
trying to Consider both heap-allocated buffers and structures which are
designed to be marked immutable (though start out mutable). The buffer would
allow writes (or DMA) until it was marked immutable.
Yep but your thinking old style J . Use the OO type safe language .. The
creator can be the only one with access (and the HW you cant stop DMA note
IO can only be instantiate via trusted code IL has no in/out instructions)
by using internal/private methods for changes when finished just tell the
creator or make the Finish method public.
The structures, even with mutable, would only allow pointers to immutable
data. This constraint assures the structure can be safetly made immutable at
any time and the entire subtree is immutable. This pattern is an
"incremental construction of subtrees of immutable data", which can be
handed to a cross-domain call accepting only an immutable pointer. A buffer
manager could keep around immutable blocks and hand them out to different
callers as necessary. Cross-GC references would manage lifetimes for the
data. Structures could cheaply aggregate the total memory size of the
subtree for accounting the size of a cross-heap reference.
Note .NET really helps here as all the value types + string are immutable.
It seems one challenge of this system is cheaply tracking the cross-GC
domain references. Perhaps all immutable pointers would need to be boxed
into cross-GC handles
Note however you get the Compiler to do the work for you by marking Shared
Memory objects with an attribute at compile time when a local or stack
reference is assigned to a shared structure you can be informed of this. You
can just use a separate GC for the Shared memory which walks the stack for
There are many solutions here. This is what I have now
There are 3 levels of shared memory.
Process can directly access sub processes memory (and could pass a pointer
to itself) this is completely controlled by the process. Note it is only
possible within process trees else requires very careful management of the
Shared memory managed through the IPC Shared Memory System. All such classes
must be marked with SharedMemory attributes. This forces the compiler to use
the shared memory allocator. To use this shared memory a pointer is simply
passed to a process. Locks are placed and all access should be done through
locks. If a process terminates the state of the memory should be fine if it
did not hold a lock. This method is used by the IPC system ( eg the messages
are placed in a level 2 shared memory space and the lock is normally held by
the sender ) . It relies on both parties doing the right thing with the
lock. It is fast and should only be used when sender and receiver are
controlled ( eg with IPC it controls the sender and receiver). A control
block exposes an EventWaitHandle.
As level 2 but more carefully managed , the pointer are indirect and will
deny access if the user is not the current owner ie there is only 1 user who
has visibility of the shared memory at one time. The pointer is contain a
reference to a control block which exposes an EventWaitHandle. This is the
recommended method it is secure and reliable.
5. Real-time garbage collection. Ben made reference to "real time or
counting collectors for device drivers". Counting collectors introduce the
possibility of memory leaks through cycles, unless they have a sweeping
check which eliminates their real-time nature. Real-time sweeping/copying
collectors are a highly unsolved problem. Typical "success stories" I've
seen involve hard bounds on heap-size to limit pause times. When heap sizes
are not bounded they are not real-time or hardware support is required. Do
you have a reference to something you feel is promising?
No issue with bounding heap size. In fact all apps will have a MetaData
memory and cpu time limit which may or may not be used. Note each driver has
its own heap.
IMO, the most promising work in real-time collectors is the triad of MS
Research I mentioned in my earlier post (chicken, stopless, and clover).
These are soft real-time collectors, that produce acceptable delays in most
real-world situations. However, they either require the overhead of
introducing a write-barrier in all code, or having two versions of code, one
with a write barrier.
Again, this speaks to the performance overhead argument I made above. It may
be that we're happier spending ~35% of our CPU on write barriers and garbage
collection scans, rather than on MMU/TLB costs, because our code can be
simpler. However, it's hard to argue that a radical redesign like this will
have superior performance before the issues are worked out.
I will avoid write barriers like the plague or anything else interfering
with execution stream. Device drivers will use little memory ( most will be
shared memory even) if worst case they stop for 1ms it just means it's not a
good real time driver ( as long as the real time scheduler is aware of it
then it can still meet its commitments) most real time systems are dog slow
they just work within the time frame advertised ( hence why if you limit the
heap size you know the time) . I probably misused the real time term as
well I'm more interested in short interruptions for device drivers rather
than doing things in deterministic times though a real time Cosmos is
intriguing . Also note device drivers do no IO they communicate with the
System specific HAL which does it all.
As the GC marks can be done on idle threads ( which is the most time
consuming) and don't need to stop the world I don't think it will be a big
deal , the actual sweep does need a stop the world im not smart enough to
solve that problem . I suspect the actual cost of GC to the process running
will be about 1% ( though it will prob use about 4% resources on idle
threads) however I think the Nursery heap allocation being 3 x86
instructions will be much faster than traditional C heap or .NET Nursery
I'm sort of toying with the idea of having a system wide shared Nursery for
allocations as a large Nursery size can give you a big benefit when a
collection is needed you go down the most active process list calling their
collectors to update their heaps if not enough is cleaned you go to the next
one ( the collectors knows the reference belongs to it if it's in its
allocated memory ranges) etc Needs a lot more thought before being