-
A Hybrid Execution Environment for Computer-Interpretable Guidelines in PROforma
Authors:
Alexandra Kogan,
Roy Leizer,
Szymon Wilk,
David Glasspool
Abstract:
In this paper, we share our experience of developing a hybrid execution environment for computer-interpretable guidelines (CIGs) in PROforma. The proposed environment is part of the CAPABLE system which provides coaching for cancer patients and decision support for physicians. It extends a standard PROforma execution engine - Deontics Engine (DE) - with additional components that act as wrappers a…
▽ More
In this paper, we share our experience of developing a hybrid execution environment for computer-interpretable guidelines (CIGs) in PROforma. The proposed environment is part of the CAPABLE system which provides coaching for cancer patients and decision support for physicians. It extends a standard PROforma execution engine - Deontics Engine (DE) - with additional components that act as wrappers around DE, allow handling of non-standard tasks, and facilitate integration with the rest of the CAPABLE system. This yields a hybrid environment in which the standard engine and specialized components must be interfaced together by some intervening layer. In the CAPABLE system this has been achieved by defining a set of specialized meta-properties which are attached to data and tasks in the PROforma CIGs to specify the interface between engine and components.
△ Less
Submitted 7 April, 2024;
originally announced April 2024.
-
Spot the bot: Coarse-Grained Partition of Semantic Paths for Bots and Humans
Authors:
Vasilii A. Gromov,
Alexandra S. Kogan
Abstract:
Nowadays, technology is rapidly advancing: bots are writing comments, articles, and reviews. Due to this fact, it is crucial to know if the text was written by a human or by a bot. This paper focuses on comparing structures of the coarse-grained partitions of semantic paths for human-written and bot-generated texts. We compare the clusterizations of datasets of n-grams from literary texts and text…
▽ More
Nowadays, technology is rapidly advancing: bots are writing comments, articles, and reviews. Due to this fact, it is crucial to know if the text was written by a human or by a bot. This paper focuses on comparing structures of the coarse-grained partitions of semantic paths for human-written and bot-generated texts. We compare the clusterizations of datasets of n-grams from literary texts and texts generated by several bots. The hypothesis is that the structures and clusterizations are different. Our research supports the hypothesis. As the semantic structure may be different for different languages, we investigate Russian, English, German, and Vietnamese languages.
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
Improving Inference Performance of Machine Learning with the Divide-and-Conquer Principle
Authors:
Alex Kogan
Abstract:
Many popular machine learning models scale poorly when deployed on CPUs. In this paper we explore the reasons why and propose a simple, yet effective approach based on the well-known Divide-and-Conquer Principle to tackle this problem of great practical importance. Given an inference job, instead of using all available computing resources (i.e., CPU cores) for running it, the idea is to break the…
▽ More
Many popular machine learning models scale poorly when deployed on CPUs. In this paper we explore the reasons why and propose a simple, yet effective approach based on the well-known Divide-and-Conquer Principle to tackle this problem of great practical importance. Given an inference job, instead of using all available computing resources (i.e., CPU cores) for running it, the idea is to break the job into independent parts that can be executed in parallel, each with the number of cores according to its expected computational cost. We implement this idea in the popular OnnxRuntime framework and evaluate its effectiveness with several use cases, including the well-known models for optical character recognition (PaddleOCR) and natural language processing (BERT).
△ Less
Submitted 1 March, 2023; v1 submitted 12 January, 2023;
originally announced January 2023.
-
Intra-process Caching and Reuse of Threads
Authors:
Dave Dice,
Alex Kogan
Abstract:
Creating and destroying threads on modern Linux systems incurs high latency, absent concurrency, and fails to scale as we increase concurrency. To address this concern we introduce a process-local cache of idle threads. Specifically, instead of destroying a thread when it terminates, we cache and then recycle that thread in the context of subsequent thread creation requests. This approach shows si…
▽ More
Creating and destroying threads on modern Linux systems incurs high latency, absent concurrency, and fails to scale as we increase concurrency. To address this concern we introduce a process-local cache of idle threads. Specifically, instead of destroying a thread when it terminates, we cache and then recycle that thread in the context of subsequent thread creation requests. This approach shows significant promise in various applications and benchmarks that create and destroy threads rapidly and illustrates the need for and potential benefits of improved concurrency infrastructure. With caching, the cost of creating a new thread drops by almost an order of magnitude. As our experiments demonstrate, this results in significant performance improvements for multiple applications that aggressively create and destroy numerous threads.
△ Less
Submitted 16 May, 2021;
originally announced May 2021.
-
Ready When You Are: Efficient Condition Variables via Delegated Condition Evaluation
Authors:
Dave Dice,
Alex Kogan
Abstract:
Multi-thread applications commonly utilize condition variables for communication between threads. Condition variables allow threads to block and wait until a certain condition holds, and also enable threads to wake up their blocked peers notifying them about a change to the state of shared data. Quite often such notifications are delivered to all threads, while only a small number of specific thre…
▽ More
Multi-thread applications commonly utilize condition variables for communication between threads. Condition variables allow threads to block and wait until a certain condition holds, and also enable threads to wake up their blocked peers notifying them about a change to the state of shared data. Quite often such notifications are delivered to all threads, while only a small number of specific threads is interested in it. This results in so-called futile wakeups, where threads receiving the notification wake up and resume their execution only to realize that the condition they are waiting for does not hold and they need to wait again. Those wakeups cause numerous context switches, increase lock contention and cache pressure, translating into lots of wasted computing cycles and energy.
In this work, we propose to delegate conditions on which threads are waiting to the thread sending notifications. This enables the latter to evaluate the conditions and send the notification(s) only to the relevant thread(s), practically eliminating futile wakeups altogether. Our initial evaluation of this idea shows promising results, achieving 3-4x throughput improvement over legacy condition variables.
△ Less
Submitted 14 May, 2021;
originally announced May 2021.
-
Optimizing Inference Performance of Transformers on CPUs
Authors:
Dave Dice,
Alex Kogan
Abstract:
The Transformer architecture revolutionized the field of natural language processing (NLP). Transformers-based models (e.g., BERT) power many important Web services, such as search, translation, question-answering, etc. While enormous research attention is paid to the training of those models, relatively little efforts are made to improve their inference performance. This paper comes to address th…
▽ More
The Transformer architecture revolutionized the field of natural language processing (NLP). Transformers-based models (e.g., BERT) power many important Web services, such as search, translation, question-answering, etc. While enormous research attention is paid to the training of those models, relatively little efforts are made to improve their inference performance. This paper comes to address this gap by presenting an empirical analysis of scalability and performance of inferencing a Transformer-based model on CPUs. Focusing on the highly popular BERT model, we identify key components of the Transformer architecture where the bulk of the computation happens, and propose three optimizations to speed them up. The optimizations are evaluated using the inference benchmark from HuggingFace, and are shown to achieve the speedup of up to x2.37. The considered optimizations do not require any changes to the implementation of the models nor affect their accuracy.
△ Less
Submitted 22 February, 2021; v1 submitted 12 February, 2021;
originally announced February 2021.
-
Compact Java Monitors
Authors:
Dave Dice,
Alex Kogan
Abstract:
For scope and context, the idea we'll describe below, Compact Java Monitors, is intended as a potential replacement implementation for the "synchronized" construct in the HotSpot JVM. The readers is assumed to be familiar with current HotSpot implementation.
For scope and context, the idea we'll describe below, Compact Java Monitors, is intended as a potential replacement implementation for the "synchronized" construct in the HotSpot JVM. The readers is assumed to be familiar with current HotSpot implementation.
△ Less
Submitted 26 September, 2021; v1 submitted 28 January, 2021;
originally announced February 2021.
-
Hemlock : Compact and Scalable Mutual Exclusion
Authors:
Dave Dice,
Alex Kogan
Abstract:
We present Hemlock, a novel mutual exclusion locking algorithm that is extremely compact, requiring just one word per thread plus one word per lock, but which still provides local spinning in most circumstances, high throughput under contention, and low latency in the uncontended case. Hemlock is context-free -- not requiring any information to be passed from a lock operation to the corresponding…
▽ More
We present Hemlock, a novel mutual exclusion locking algorithm that is extremely compact, requiring just one word per thread plus one word per lock, but which still provides local spinning in most circumstances, high throughput under contention, and low latency in the uncontended case. Hemlock is context-free -- not requiring any information to be passed from a lock operation to the corresponding unlock -- and FIFO. The performance of Hemlock is competitive with and often better than the best scalable spin locks.
△ Less
Submitted 5 January, 2022; v1 submitted 7 February, 2021;
originally announced February 2021.
-
Efficient Multi-word Compare and Swap
Authors:
Rachid Guerraoui,
Alex Kogan,
Virendra J. Marathe,
Igor Zablotchi
Abstract:
Atomic lock-free multi-word compare-and-swap (MCAS) is a powerful tool for designing concurrent algorithms. Yet, its widespread usage has been limited because lock-free implementations of MCAS make heavy use of expensive compare-and-swap (CAS) instructions. Existing MCAS implementations indeed use at least 2k+1 CASes per k-CAS. This leads to the natural desire to minimize the number of CASes requi…
▽ More
Atomic lock-free multi-word compare-and-swap (MCAS) is a powerful tool for designing concurrent algorithms. Yet, its widespread usage has been limited because lock-free implementations of MCAS make heavy use of expensive compare-and-swap (CAS) instructions. Existing MCAS implementations indeed use at least 2k+1 CASes per k-CAS. This leads to the natural desire to minimize the number of CASes required to implement MCAS. We first prove in this paper that it is impossible to "pack" the information required to perform a k-word CAS (k-CAS) in less than k locations to be CASed. Then we present the first algorithm that requires k+1 CASes per call to k-CAS in the common uncontended case. We implement our algorithm and show that it outperforms a state-of-the-art baseline in a variety of benchmarks in most considered workloads. We also present a durably linearizable (persistent memory friendly) version of our MCAS algorithm using only 2 persistence fences per call, while still only requiring k+1 CASes per k-CAS.
△ Less
Submitted 6 August, 2020;
originally announced August 2020.
-
Scalable Range Locks for Scalable Address Spaces and Beyond
Authors:
Alex Kogan,
Dave Dice,
Shady Issa
Abstract:
Range locks are a synchronization construct designed to provide concurrent access to multiple threads (or processes) to disjoint parts of a shared resource. Originally conceived in the file system context, range locks are gaining increasing interest in the Linux kernel community seeking to alleviate bottlenecks in the virtual memory management subsystem. The existing implementation of range locks…
▽ More
Range locks are a synchronization construct designed to provide concurrent access to multiple threads (or processes) to disjoint parts of a shared resource. Originally conceived in the file system context, range locks are gaining increasing interest in the Linux kernel community seeking to alleviate bottlenecks in the virtual memory management subsystem. The existing implementation of range locks in the kernel, however, uses an internal spin lock to protect the underlying tree structure that keeps track of acquired and requested ranges. This spin lock becomes a point of contention on its own when the range lock is frequently acquired. Furthermore, where and exactly how specific (refined) ranges can be locked remains an open question.
In this paper, we make two independent, but related contributions. First, we propose an alternative approach for building range locks based on linked lists. The lists are easy to maintain in a lock-less fashion, and in fact, our range locks do not use any internal locks in the common case. Second, we show how the range of the lock can be refined in the mprotect operation through a speculative mechanism. This refinement, in turn, allows concurrent execution of mprotect operations on non-overlapping memory regions. We implement our new algorithms and demonstrate their effectiveness in user-space and kernel-space, achieving up to 9$\times$ speedup compared to the stock version of the Linux kernel. Beyond the virtual memory management subsystem, we discuss other applications of range locks in parallel software. As a concrete example, we show how range locks can be used to facilitate the design of scalable concurrent data structures, such as skip lists.
△ Less
Submitted 22 June, 2020;
originally announced June 2020.
-
Fissile Locks
Authors:
Dave Dice,
Alex Kogan
Abstract:
Classic test-and-test (TS) mutual exclusion locks are simple, and enjoy high performance and low latency of ownership transfer under light or no contention. However, they do not scale gracefully under high contention and do not provide any admission order guarantees. Such concerns led to the development of scalable queue-based locks, such as a recent Compact NUMA-aware (CNA) lock, a variant of ano…
▽ More
Classic test-and-test (TS) mutual exclusion locks are simple, and enjoy high performance and low latency of ownership transfer under light or no contention. However, they do not scale gracefully under high contention and do not provide any admission order guarantees. Such concerns led to the development of scalable queue-based locks, such as a recent Compact NUMA-aware (CNA) lock, a variant of another popular queue-based MCS lock. CNA scales well under load and provides certain admission guarantees, but has more complicated lock handover operations than TS and incurs higher latencies at low contention. We propose Fissile locks, which capture the most desirable properties of both TS and CNA. A Fissile lock consists of two underlying locks: a TS lock, which serves as a fast path, and a CNA lock, which serves as a slow path. The key feature of Fissile locks is the ability of threads on the fast path to bypass threads enqueued on the slow path, and acquire the lock with less overhead than CNA. Bypass is bounded (by a tunable parameter) to avoid starvation and ensure long-term fairness. The result is a highly scalable NUMA-aware lock with progress guarantees that performs like TS at low contention and like CNA at high contention.
△ Less
Submitted 1 May, 2020; v1 submitted 10 March, 2020;
originally announced March 2020.
-
Non-congruent non-degenerate curves with identical signatures
Authors:
Eric Geiger,
Irina A. Kogan
Abstract:
While the equality of differential signatures (Calabi et al, Int. J. Comput. Vis. 26: 107-135, 1998) is known to be a necessary condition for congruence, it is not sufficient (Musso and Nicolodi, J. Math Imaging Vis. 35: 68-85, 2009). Hickman (J. Math Imaging Vis. 43: 206-213, 2012, Theorem 2) claimed that for non-degenerate planar curves, equality of Euclidean signatures implies congruence. We pr…
▽ More
While the equality of differential signatures (Calabi et al, Int. J. Comput. Vis. 26: 107-135, 1998) is known to be a necessary condition for congruence, it is not sufficient (Musso and Nicolodi, J. Math Imaging Vis. 35: 68-85, 2009). Hickman (J. Math Imaging Vis. 43: 206-213, 2012, Theorem 2) claimed that for non-degenerate planar curves, equality of Euclidean signatures implies congruence. We prove that while Hickman's claim holds for simple, closed curves with simple signatures, it fails for curves with non-simple signatures. In the later case, we associate a directed graph with the signature and show how various paths along the graph give rise to a family of non-congruent, non-degenerate curves with identical signatures. Using this additional structure, we formulate congruence criteria for non-degenerate, closed, simple curves and show how the paths reflect the global and local symmetries of the corresponding curve.
△ Less
Submitted 22 July, 2021; v1 submitted 19 December, 2019;
originally announced December 2019.
-
Avoiding Scalability Collapse by Restricting Concurrency
Authors:
Dave Dice,
Alex Kogan
Abstract:
Saturated locks often degrade the performance of a multithreaded application, leading to a so-called scalability collapse problem. This problem arises when a growing number of threads circulating through a saturated lock causes the overall application performance to fade or even drop abruptly. This problem is particularly (but not solely) acute on oversubscribed systems (systems with more threads…
▽ More
Saturated locks often degrade the performance of a multithreaded application, leading to a so-called scalability collapse problem. This problem arises when a growing number of threads circulating through a saturated lock causes the overall application performance to fade or even drop abruptly. This problem is particularly (but not solely) acute on oversubscribed systems (systems with more threads than available hardware cores). In this paper, we introduce GCR (generic concurrency restriction), a mechanism that aims to avoid the scalability collapse. GCR, designed as a generic, lock-agnostic wrapper, intercepts lock acquisition calls, and decides when threads would be allowed to proceed with the acquisition of the underlying lock. Furthermore, we present GCR-NUMA, a non-uniform memory access (NUMA)-aware extension of GCR, that strives to ensure that threads allowed to acquire the lock are those that run on the same socket. The extensive evaluation that includes more than two dozen locks, three machines and three benchmarks shows that GCR brings substantial speedup (in many cases, up to three orders of magnitude) in case of contention and growing thread counts, while introducing nearly negligible slowdown when the underlying lock is not contended. GCR-NUMA brings even larger performance gains starting at even lighter lock contention.
△ Less
Submitted 11 July, 2019; v1 submitted 26 May, 2019;
originally announced May 2019.
-
Compact NUMA-Aware Locks
Authors:
Dave Dice,
Alex Kogan
Abstract:
Modern multi-socket architectures exhibit non-uniform memory access (NUMA) behavior, where access by a core to data cached locally on a socket is much faster than access to data cached on a remote socket. Prior work offers several efficient NUMA-aware locks that exploit this behavior by keeping the lock ownership on the same socket, thus reducing remote cache misses and inter-socket communication.…
▽ More
Modern multi-socket architectures exhibit non-uniform memory access (NUMA) behavior, where access by a core to data cached locally on a socket is much faster than access to data cached on a remote socket. Prior work offers several efficient NUMA-aware locks that exploit this behavior by keeping the lock ownership on the same socket, thus reducing remote cache misses and inter-socket communication. Virtually all those locks, however, are hierarchical in their nature, thus requiring space proportional to the number of sockets. The increased memory cost renders NUMA-aware locks unsuitable for systems that are conscious to space requirements of their synchronization constructs, with the Linux kernel being the chief example.
In this work, we present a compact NUMA-aware lock that requires only one word of memory, regardless of the number of sockets in the underlying machine. The new lock is a variant of an efficient (NUMA-oblivious) MCS lock, and inherits its performant features, such as local spinning and a single atomic instruction in the acquisition path. Unlike MCS, the new lock organizes waiting threads in two queues, one composed of threads running on the same socket as the current lock holder, and another composed of threads running on a different socket(s).
We integrated the new lock in the Linux kernel's qspinlock, one of the major synchronization constructs in the kernel. Our evaluation using both user-space and kernel benchmarks shows that the new lock has a single-thread performance of MCS, but significantly outperforms the latter under contention, achieving a similar level of performance when compared to other, state-of-the-art NUMA-aware locks that require substantially more space.
△ Less
Submitted 1 March, 2019; v1 submitted 12 October, 2018;
originally announced October 2018.
-
TWA -- Ticket Locks Augmented with a Waiting Array
Authors:
Dave Dice,
Alex Kogan
Abstract:
The classic ticket lock consists of ticket and grant fields. Arriving threads atomically fetch-and-increment ticket and then wait for grant to become equal to the value returned by the fetch-and-increment primitive, at which point the thread holds the lock. The corresponding unlock operation simply increments grant. This simple design has short code paths and fast handover (transfer of ownership)…
▽ More
The classic ticket lock consists of ticket and grant fields. Arriving threads atomically fetch-and-increment ticket and then wait for grant to become equal to the value returned by the fetch-and-increment primitive, at which point the thread holds the lock. The corresponding unlock operation simply increments grant. This simple design has short code paths and fast handover (transfer of ownership) under light contention, but may suffer degraded scalability under high contention when multiple threads busy wait on the grant field -- so-called global spinning. We propose a variation on ticket locks where long-term waiting threads wait on locations in a waiting array instead of busy waiting on the grant field. The single waiting array is shared among all locks. Short-term waiting is accomplished in the usual manner on the grant field. The resulting algorithm, TWA, improves on ticket locks by limiting the number of threads spinning on the grant field at any given time, reducing the number of remote caches requiring invalidation from the store that releases the lock. In turn, this accelerates handover, and since the lock is held throughout the handover operation, scalability improves. Under light or no contention, TWA yields performance comparable to the classic ticket lock, avoiding the complexity and extra accesses incurred by MCS locks in the handover path, but providing performance above or beyond that of MCS at high contention.
△ Less
Submitted 10 July, 2019; v1 submitted 2 October, 2018;
originally announced October 2018.
-
BRAVO -- Biased Locking for Reader-Writer Locks
Authors:
David Dice,
Alex Kogan
Abstract:
Designers of modern reader-writer locks confront a difficult trade-off related to reader scalability. Locks that have a compact memory representation for active readers will typically suffer under high intensity read-dominated workloads when the "reader indicator"' state is updated frequently by a diverse set of threads, causing cache invalidation and coherence traffic. Other designs, such as coho…
▽ More
Designers of modern reader-writer locks confront a difficult trade-off related to reader scalability. Locks that have a compact memory representation for active readers will typically suffer under high intensity read-dominated workloads when the "reader indicator"' state is updated frequently by a diverse set of threads, causing cache invalidation and coherence traffic. Other designs, such as cohort reader-writer locks, use distributed reader indicators, one per NUMA node. This improves reader-reader scalability, but also increases the size of each lock instance. We propose a simple transformation BRAVO, that augments any existing reader-writer lock, adding just two integer fields to the lock instance. Readers make their presence known to writers by hashing their thread's identity with the lock address, forming an index into a visible readers table. Readers attempt to install the lock address into that element in the table, making their existence known to potential writers. All locks and threads in an address space can share the visible readers table. Updates by readers tend to be diffused over the table, resulting in a NUMA-friendly design. Crucially, readers of the same lock tend to write to different locations in the array, reducing coherence traffic. Specifically, BRAVO allows a simple compact lock to be augmented so as to provide scalable concurrent reading but with only a modest increase in footprint.
△ Less
Submitted 10 July, 2019; v1 submitted 2 October, 2018;
originally announced October 2018.
-
Algorithm for computing $μ$-bases of univariate polynomials
Authors:
Hoon Hong,
Zachary Hough,
Irina A. Kogan
Abstract:
We present a new algorithm for computing a $μ$-basis of the syzygy module of $n$ polynomials in one variable over an arbitrary field $\mathbb{K}$. The algorithm is conceptually different from the previously-developed algorithms by Cox, Sederberg, Chen, Zheng, and Wang for $n=3$, and by Song and Goldman for an arbitrary $n$. It involves computing a "partial" reduced row-echelon form of a…
▽ More
We present a new algorithm for computing a $μ$-basis of the syzygy module of $n$ polynomials in one variable over an arbitrary field $\mathbb{K}$. The algorithm is conceptually different from the previously-developed algorithms by Cox, Sederberg, Chen, Zheng, and Wang for $n=3$, and by Song and Goldman for an arbitrary $n$. It involves computing a "partial" reduced row-echelon form of a $ (2d+1)\times n(d+1)$ matrix over $\mathbb{K}$, where $d$ is the maximum degree of the input polynomials. The proof of the algorithm is based on standard linear algebra and is completely self-contained. It includes a proof of the existence of the $μ$-basis and as a consequence provides an alternative proof of the freeness of the syzygy module. The theoretical (worst case asymptotic) computational complexity of the algorithm is $O(d^2n+d^3+n^2)$. We have implemented this algorithm (HHK) and the one developed by Song and Goldman (SG). Experiments on random inputs indicate that SG gets faster than HHK when $d$ gets sufficiently large for a fixed $n$, and that HHK gets faster than SG when $n$ gets sufficiently large for a fixed $d$.
△ Less
Submitted 8 March, 2017; v1 submitted 15 March, 2016;
originally announced March 2016.
-
Invariants of objects and their images under surjective maps
Authors:
Irina A. Kogan,
Peter J. Olver
Abstract:
We examine the relationships between the differential invariants of objects and of their images under a surjective map. We analyze both the case when the underlying transformation group is projectable and hence induces an action on the image, and the case when only a proper subgroup of the entire group acts projectably. In the former case, we establish a constructible isomorphism between the algeb…
▽ More
We examine the relationships between the differential invariants of objects and of their images under a surjective map. We analyze both the case when the underlying transformation group is projectable and hence induces an action on the image, and the case when only a proper subgroup of the entire group acts projectably. In the former case, we establish a constructible isomorphism between the algebra of differential invariants of the images and the algebra of fiber-wise constant (gauge) differential invariants of the objects. In the latter case, we describe residual effects of the full transformation group on the image invariants. Our motivation comes from the problem of reconstruction of an object from multiple-view images, with central and parallel projections of curves from three-dimensional space to the two-dimensional plane serving as our main examples.
△ Less
Submitted 22 September, 2015;
originally announced September 2015.
-
The Influence of Malloc Placement on TSX Hardware Transactional Memory
Authors:
Dave Dice,
Tim Harris,
Alex Kogan,
Yossi Lev
Abstract:
The hardware transactional memory (HTM) implementation in Intel's i7-4770 "Haswell" processor tracks the transactional read-set in the L1 (level-1), L2 (level-2) and L3 (level-3) caches and the write-set in the L1 cache. Displacement or eviction of read-set entries from the cache hierarchy or write-set entries from the L1 results in abort. We show that the placement policies of dynamic storage all…
▽ More
The hardware transactional memory (HTM) implementation in Intel's i7-4770 "Haswell" processor tracks the transactional read-set in the L1 (level-1), L2 (level-2) and L3 (level-3) caches and the write-set in the L1 cache. Displacement or eviction of read-set entries from the cache hierarchy or write-set entries from the L1 results in abort. We show that the placement policies of dynamic storage allocators -- such as those found in common "malloc" implementations -- can influence the L1 conflict miss rate in the L1. Conflict misses -- sometimes called mapping misses -- arise because of less than ideal associativity and represent imbalanced distribution of active memory blocks over the set of available L1 indices. Under transactional execution conflict misses may manifest as aborts, representing wasted or futile effort instead of a simple stall as would occur in normal execution mode.
Furthermore, when HTM is used for transactional lock elision (TLE), persistent aborts arising from conflict misses can force the offending thread through the so-called "slow path". The slow path is undesirable as the thread must acquire the lock and run the critical section in normal execution mode, precluding the concurrent execution of threads in the "fast path" that monitor that same lock and run their critical sections in transactional mode. For a given lock, multiple threads can concurrently use the transactional fast path, but at most one thread can use the non-transactional slow path at any given time. Threads in the slow path preclude safe concurrent fast path execution. Aborts rising from placement policies and L1 index imbalance can thus result in loss of concurrency and reduced aggregate throughput.
△ Less
Submitted 29 April, 2015; v1 submitted 17 April, 2015;
originally announced April 2015.
-
Hardware extensions to make lazy subscription safe
Authors:
Dave Dice,
Timothy L. Harris,
Alex Kogan,
Yossi Lev,
Mark Moir
Abstract:
Transactional Lock Elision (TLE) uses Hardware Transactional Memory (HTM) to execute unmodified critical sections concurrently, even if they are protected by the same lock. To ensure correctness, the transactions used to execute these critical sections "subscribe" to the lock by reading it and checking that it is available. A recent paper proposed using the tempting "lazy subscription" optimi…
▽ More
Transactional Lock Elision (TLE) uses Hardware Transactional Memory (HTM) to execute unmodified critical sections concurrently, even if they are protected by the same lock. To ensure correctness, the transactions used to execute these critical sections "subscribe" to the lock by reading it and checking that it is available. A recent paper proposed using the tempting "lazy subscription" optimization for a similar technique in a different context, namely transactional systems that use a single global lock (SGL) to protect all transactional data. We identify several pitfalls that show that lazy subscription \emph{is not safe} for TLE because unmodified critical sections executing before subscribing to the lock may behave incorrectly in a number of subtle ways. We also show that recently proposed compiler support for modifying transaction code to ensure subscription occurs before any incorrect behavior could manifest is not sufficient to avoid all of the pitfalls we identify. We further argue that extending such compiler support to avoid all pitfalls would add substantial complexity and would usually limit the extent to which subscription can be deferred, undermining the effectiveness of the optimization. Hardware extensions suggested in the recent proposal also do not address all of the pitfalls we identify. In this extended version of our WTTM 2014 paper, we describe hardware extensions that make lazy subscription safe, both for SGL-based transactional systems and for TLE, without the need for special compiler support. We also explain how nontransactional loads can be exploited, if available, to further enhance the effectiveness of lazy subscription.
△ Less
Submitted 23 July, 2014;
originally announced July 2014.
-
Object-Image Correspondence for Algebraic Curves under Projections
Authors:
Joseph M. Burdis,
Irina A. Kogan,
Hoon Hong
Abstract:
We present a novel algorithm for deciding whether a given planar curve is an image of a given spatial curve, obtained by a central or a parallel projection with unknown parameters. The motivation comes from the problem of establishing a correspondence between an object and an image, taken by a camera with unknown position and parameters. A straightforward approach to this problem consists of setti…
▽ More
We present a novel algorithm for deciding whether a given planar curve is an image of a given spatial curve, obtained by a central or a parallel projection with unknown parameters. The motivation comes from the problem of establishing a correspondence between an object and an image, taken by a camera with unknown position and parameters. A straightforward approach to this problem consists of setting up a system of conditions on the projection parameters and then checking whether or not this system has a solution. The computational advantage of the algorithm presented here, in comparison to algorithms based on the straightforward approach, lies in a significant reduction of a number of real parameters that need to be eliminated in order to establish existence or non-existence of a projection that maps a given spatial curve to a given planar curve. Our algorithm is based on projection criteria that reduce the projection problem to a certain modification of the equivalence problem of planar curves under affine and projective transformations. To solve the latter problem we make an algebraic adaptation of signature construction that has been used to solve the equivalence problems for smooth curves. We introduce a notion of a classifying set of rational differential invariants and produce explicit formulas for such invariants for the actions of the projective and the affine groups on the plane.
△ Less
Submitted 10 March, 2019; v1 submitted 14 March, 2013;
originally announced March 2013.
-
Object-image correspondence for curves under projections
Authors:
Joseph M. Burdis,
Irina A. Kogan
Abstract:
We present a novel algorithm for deciding whether a given planar curve is an image of a given spatial curve, obtained by a central or a parallel projection with unknown parameters. A straightforward approach to this problem consists of setting up a system of conditions on the projection parameters and then checking whether or not this system has a solution. The main advantage of the algorithm pres…
▽ More
We present a novel algorithm for deciding whether a given planar curve is an image of a given spatial curve, obtained by a central or a parallel projection with unknown parameters. A straightforward approach to this problem consists of setting up a system of conditions on the projection parameters and then checking whether or not this system has a solution. The main advantage of the algorithm presented here, in comparison to algorithms based on the straightforward approach, lies in a significant reduction of a number of real parameters that need to be eliminated in order to establish existence or non-existence of a projection that maps a given spatial curve to a given planar curve. Our algorithm is based on projection criteria that reduce the projection problem to a certain modification of the equivalence problem of planar curves under affine and projective transformations. The latter problem is then solved by differential signature construction based on Cartan's moving frame method. A similar approach can be used to decide whether a given finite set of ordered points on a plane is an image of a given finite set of ordered points in R^3. The motivation comes from the problem of establishing a correspondence between an object and an image, taken by a camera with unknown position and parameters.
△ Less
Submitted 15 March, 2013; v1 submitted 6 February, 2012;
originally announced February 2012.
-
Object-image correspondence for curves under finite and affine cameras
Authors:
Joseph M. Burdis,
Irina A. Kogan
Abstract:
We provide criteria for deciding whether a given planar curve is an image of a given spatial curve, obtained by a central or a parallel projection with unknown parameters. These criteria reduce the projection problem to a certain modification of the equivalence problem of planar curves under affine and projective transformations. The latter problem can be addressed using Cartan's moving frame meth…
▽ More
We provide criteria for deciding whether a given planar curve is an image of a given spatial curve, obtained by a central or a parallel projection with unknown parameters. These criteria reduce the projection problem to a certain modification of the equivalence problem of planar curves under affine and projective transformations. The latter problem can be addressed using Cartan's moving frame method. This leads to a novel algorithmic solution of the projection problem for curves. The computational advantage of the algorithms presented here, in comparison to algorithms based on a straightforward solution, lies in a significant reduction of a number of real parameters that has to be eliminated in order to establish existence or non-existence of a projection that maps a given spatial curve to a given planar curve. The same approach can be used to decide whether a given finite set of ordered points on a plane is an image of a given finite set of ordered points in R^3. The motivation comes from the problem of establishing a correspondence between an object and an image, taken by a camera with unknown position and parameters.
△ Less
Submitted 28 February, 2011; v1 submitted 2 April, 2010;
originally announced April 2010.
-
Classification of curves in 2D and 3D via affine integral signatures
Authors:
S. Feng,
I. A. Kogan,
H. Krim
Abstract:
We propose a robust classification algorithm for curves in 2D and 3D, under the special and full groups of affine transformations. To each plane or spatial curve we assign a plane signature curve. Curves, equivalent under an affine transformation, have the same signature. The signatures introduced in this paper are based on integral invariants, which behave much better on noisy images than class…
▽ More
We propose a robust classification algorithm for curves in 2D and 3D, under the special and full groups of affine transformations. To each plane or spatial curve we assign a plane signature curve. Curves, equivalent under an affine transformation, have the same signature. The signatures introduced in this paper are based on integral invariants, which behave much better on noisy images than classically known differential invariants. The comparison with other types of invariants is given in the introduction. Though the integral invariants for planar curves were known before, the affine integral invariants for spatial curves are proposed here for the first time. Using the inductive variation of the moving frame method we compute affine invariants in terms of Euclidean invariants. We present two types of signatures, the global signature and the local signature. Both signatures are independent of parameterization (curve sampling). The global signature depends on the choice of the initial point and does not allow us to compare fragments of curves, and is therefore sensitive to occlusions. The local signature, although is slightly more sensitive to noise, is independent of the choice of the initial point and is not sensitive to occlusions in an image. It helps establish local equivalence of curves. The robustness of these invariants and signatures in their application to the problem of classification of noisy spatial curves extracted from a 3D object is analyzed.
△ Less
Submitted 11 June, 2008;
originally announced June 2008.