Newest 'distributed-computing' Questions

0 votes

0 answers

14 views

reliable protocol guarantee complete delivery no in order promise

The sender is sending N packets to receiver. I want a protocol or method that guarantees delivery, each packet is received at least once. It is ok if some packets are received more than once due to ...

Yufei Zheng

13

asked Jul 11 at 9:36

-2 votes

0 answers

21 views

How to Seamlessly Distribute Application Processing Between Two Windows Computers Based on Resource Availability? [closed]

I'm looking to create a distributed computing setup using two Windows 10 computers, where applications can run on either computer based on resource availability. The goal is for users to interact with ...

Odd Magnus Grinder

1

asked Jul 8 at 17:59

0 votes

0 answers

19 views

Does zookeeper preserver order when moving sessions?

In the zookeeper book it says: When a client creates a ZooKeeper handle using a specific language binding, it establishes a session with the service. The client initially connects to any server in ...

codefast

89

asked Jul 4 at 2:03

0 votes

0 answers

28 views

How to securely conduct lottery-like draws with guaranteed randomness without auditing?

Is there an existing algorithm or method to conduct lottery-like draws that ensures secure and truly random results without the need for auditing? There are any lib to do this? I search on the web ...

aguiadouro

91

asked Jun 24 at 12:51

0 votes

2 answers

32 views

Unable to run code on Multiple GPUs in PyTorch - Usage shows only 1 GPU is being utilized

I am training a Transformer Encoder-Decoder based model for Text summarization. The code works without any errors but uses only 1 GPU when checked with nvidia-smi. However, I want to run it on all the ...

Abid Meraj

1

asked Jun 19 at 5:45

0 votes

0 answers

18 views

Out-of-memory problem when using dist.all_gather

I'm writing codes for multi-GPU training, and I need to gather embeddings from different gpus to calculate loss and then propagate the gradients back to different GPUs. However, when the programs runs ...

Drack Young

1

asked Jun 7 at 20:57

0 votes

0 answers

24 views

I Have Imagination of Futuristic Computing Scenarios. How Can I Get Involved?

I am currently a backend developer specializing in Java-oriented web services, with a bachelor's degree in Computer Science. After working for a few years, I have become deeply interested in diving ...

alex y

1

asked Jun 6 at 14:36

0 votes

0 answers

25 views

The distributed training model inferred the occurrence of nan values

When I trained my Mamba model on 4 GPUs through DistributedDataParallel, after the first round of training, I executed the validation code. The validation on cuda:3 process always gave Nan values, and ...

Ezra Corli

3

asked Jun 5 at 3:45

0 votes

0 answers

38 views

Distributed Training using PyTorch

I am using PyTorch's multiprocessing framework to distribute my training across multiple GPUs. I'm doing this over the batch size, so each GPU has its independent batch that it calculates the gradient ...

Gummy bears

188

asked Jun 3 at 11:15

1 vote

2 answers

99 views

How to reliably implement fan out write pattern?

I'm trying to RELIABLY implement that pattern. For practical purposes, assume we have something similar to a twitter clone (in cassandra and nodejs). So, user A has 500k followers. When user A posts a ...

InglouriousBastard

55

asked May 27 at 13:38

0 votes

0 answers

17 views

How to Deploy Replicaset and Custom Images in AWS via Ray Docker Images?

Getting started with Ray on AWS cluster and trying to understand the declarative yaml config as in ray github. I can see it is possible to directly add the Docker images of ray on the AWS ec2 ...

Della

1,550

asked May 27 at 3:14

1 vote

1 answer

72 views

Using torchrun with AWS sagemaker estimator on multi-GPU node

I would like to run a training job ml.p4d.24xlarge machine on AWS SageMaker. I ran into a similar issue described here with significant slowdowns in training time. I understand now that I should run ...

probably45

23

asked May 24 at 19:25

2 votes

0 answers

14 views

Why am I getting a "LM_WRITE_LOG_FAILED ERROR 80000" in GridDB when writing to the log file?

I'm using GridDB for managing a distributed database system and recently encountered the following error while trying to perform operations: 80000 LM_WRITE_LOG_FAILED ERROR Writing to log file failed. ...

omar esawy

61

asked May 24 at 12:03

2 votes

0 answers

11 views

Why am I getting a "JC_CONTAINER_NOT_OPENED ERROR 145034" in GridDB when performing operations on a container?

I'm working with GridDB to manage a distributed database and recently encountered the following error while performing operations on a container: 145034 JC_CONTAINER_NOT_OPENED ERROR Status check of ...

omar esawy

61

asked May 21 at 9:16

0 votes

0 answers

26 views

How do microservices communicate with each other when they are secured with Jwt?

I am currently learning microservices architecture. I got to know that you can use JWT, Oauth and bunch of other mechanisms to secure microservices but one thing that confuses me is that how do they ...

Yasin Ahmed

21

asked May 18 at 20:29

Collectives™ on Stack Overflow

Questions tagged [distributed-computing]

reliable protocol guarantee complete delivery no in order promise

How to Seamlessly Distribute Application Processing Between Two Windows Computers Based on Resource Availability? [closed]

Does zookeeper preserver order when moving sessions?

How to securely conduct lottery-like draws with guaranteed randomness without auditing?

Unable to run code on Multiple GPUs in PyTorch - Usage shows only 1 GPU is being utilized

Out-of-memory problem when using dist.all_gather

I Have Imagination of Futuristic Computing Scenarios. How Can I Get Involved?

The distributed training model inferred the occurrence of nan values

Distributed Training using PyTorch

How to reliably implement fan out write pattern?

How to Deploy Replicaset and Custom Images in AWS via Ray Docker Images?

Using torchrun with AWS sagemaker estimator on multi-GPU node

Why am I getting a "LM_WRITE_LOG_FAILED ERROR 80000" in GridDB when writing to the log file?

Why am I getting a "JC_CONTAINER_NOT_OPENED ERROR 145034" in GridDB when performing operations on a container?

How do microservices communicate with each other when they are secured with Jwt?

Hot Network Questions

Collectives™ on Stack Overflow

Questions tagged [distributed-computing]

Related Tags