Fl4m3Ph03n1x

Fl4m3Ph03n1x

Ecto multiple streams in 1 transaction

Background

PS: the following situation describes an hypothetical scenario, where I own a company that sells things to customers.

I have an Ecto query that is so big, that my machine cannot handle it. With billions of results returned, there is probably not enough RAM in the world that can handle it.

The solution here (or so my research indicates) is to use streams. Streams were made for potentially infinite sets of results, which would fit my use case.

https://hexdocs.pm/ecto/Ecto.Repo.html#c:stream/2

Problem

So lets imagine that I want to delete All users that bought a given item. Maybe that item was not really legal in their country, and now me, the poor guy in IT, has to fix things so the world doesn’t come down crashing.

Naive way:

item_id = "123asdasd123"

purchase_ids =
      Purchases
      |> where([p], p.item_id == ^item_id)
      |> select([p], p.id)
      |> Repo.all()

Users
    |> where([u], u.purchase_id in ^purchase_ids)
    |> Repo.delete_all()

This is the naive way. I call it naive, because of 2 issues:

  • We have so many purchases, that the machine’s memory will overflow (looking at purchase_ids query)
  • purchase_ids will likely have more than 100K ids, so the second query (where we delete things) will fail as it hits Postgres parameters limit of 32K: https://stackoverflow.com/a/42251312/1337392

What can I say, our product is highly addictive and very well priced!
Our customers simply cant get enough of it. Don’t know why. Nope. No reason comes to mind. None at all.

With these problems in mind, I cannot help my customers and grow my empire, I mean, little home owned business.

I did find this possible solution:

Stream way:

item_id = "123asdasd123"

purchase_ids =
      Purchases
      |> where([p], p.item_id == ^item_id)
      |> select([p], p.id)

stream = Repo.stream(purchase_ids)

Repo.transacion(fn -> 
  ids = Enum.to_list(stream)

  Users
    |> where([u], u.purchase_id in ^ids)
    |> Repo.delete_all()
end)

Questions

However, I am not convinced this will work:

  • I am using Enum.to_list and saving everything into a variable, placing everything into memory again. So I am not gaining any advantage by using Repo.stream.
  • I still have too many ids for my Repo.delete_all to work without blowing up

I guess the one advantage here is that this now a transaction, so either everything goes or nothing goes.

So, the following questions arise:

  • How do I properly make use of streams in this scenario?
  • Can I delete items by streaming parameters (ids) or do I have to manually batch them?
  • Can I stream ids to Repo.delete_all ?

Marked As Solved

Fl4m3Ph03n1x

Fl4m3Ph03n1x

Every question post created here creates an entry in a dedicated thread in the official forum iirc. Nonetheless, I still post my questions in both places. And when I find an answer, I add it to both places as well.

I do this mainly for visibility, both for the community, and for the question itself, although the later one is less impactful due to the mentioned DevChat thread the official forum has.

Solutions

In regards to the question, there are two possible solutions.

One suggested by benwilson:

query = from u in Users,
  join: p in assoc(u, :purchase),
  where: p.item_id == ^item_id

Repo.delete_all(query)

And the other by Aleksei Matiushkin:

Repo.transacion(fn ->
  max_rows = 500

  purchase_ids
  |> Repo.stream(max_rows: max_rows)
  |> Stream.chunk_every(max_rows)
  |> Stream.each(fn ids ->
     Users
     |> where([u], u.purchase_id in ^ids)
     |> Repo.delete_all()
  end)
  |> Stream.run()
end, timeout: :infinity)

My pick

The first solution is great, but it requires the User Schema to have a belongs_to :purchase, Purchase definition in its schema. Unfortunately for me, this was a deal breaker, since changing any schemas in the project where I am working in is either not allowed or would result in a lengthy approval process.

So I went with the second solution that is self contained. It requires no changes to any schemas and it can work with the data as is.

Popular Backend topics Top

PragmaticBookshelf
The next step in the evolution of user interfaces is here. Chatbots let your users interact with your service in their own natural langua...
New
bot
Kotlin v1.4.20-M1 has been released. Link: Release Kotlin 1.4.20-M1 · JetBrains/kotlin · GitHub
New
First poster: bot
Ruby on Rails v6.1.0 has been released. Link: Release 6.1.0 · rails/rails · GitHub
New
Jsdr3398
I just thought of this. Are there any disadvantages when making your server in Assembly (other than having to learn a bunch of stuff :ro...
New
PragmaticBookshelf
Build highly interactive applications without ever leaving Elixir, the way the experts do. Let LiveView take care of performance, scalabi...
New
AstonJ
More cool tech inspired by <span class="hashtag-icon-placeholder"></span>erlang :nerd_face: By combining the fault-tolerance and massiv...
New
joshi
Hey everybody! I’m working on the project that includes import of Oracle data to PostgreSQL. That data comes as Oracle export (expdp) fi...
New
Cellane
I’ve been asked by my supervisors at work to finally give everyone in the team presentation about “that Elixir thing you can’t seem to sh...
New
JimmyCarterSon
Hello, I am working on a new application with Elixir, Dish_out. I want to see Data I follow this tutorial with Elixir Casts. However, I ...
New
ManningBooks
Spring Security in Action, Second Edition is a revised version of the bestselling original, fully updated for Spring Boot 3 and Oauth2/Op...
New

Other popular topics Top

Devtalk
Hello Devtalk World! Please let us know a little about who you are and where you’re from :nerd_face:
New
AstonJ
What chair do you have while working… and why? Is there a ‘best’ type of chair or working position for developers?
New
AstonJ
We have a thread about the keyboards we have, but what about nice keyboards we come across that we want? If you have seen any that look n...
New
AstonJ
I ended up cancelling my Moonlander order as I think it’s just going to be a bit too bulky for me. I think the Planck and the Preonic (o...
New
PragmaticBookshelf
“Finding the Boundaries” Hero’s Journey with Noel Rappin @noelrappin Even when you’re ultimately right about what the future ho...
New
OvermindDL1
Woooooooo! This is such a huge release for it, and 2 years incoming! In short, the library is now using an updated hyper backend (not j...
New
mafinar
This is going to be a long an frequently posted thread. While talking to a friend of mine who has taken data structure and algorithm cou...
New
PragmaticBookshelf
Author Spotlight Mike Riley @mriley This month, we turn the spotlight on Mike Riley, author of Portable Python Projects. Mike’s book ...
New
AstonJ
Chris Seaton, the creator of TruffleRuby has died. It appears from suicide :cry: He left this note on Twitter on the weekend: And one...
New
New