Fl4m3Ph03n1x

Fl4m3Ph03n1x

How to get the top X results of a given category using Ecto?

Background

I have to queries that return a colossal amount of data on their own. I cannot use Repo.all as doing so would materialize these into memory, which would then quickly run out.

So I am trying to push as much as I can to the pSQL DB, and force the DB to do as much work as possible.

My issue starts with 2 queries.

This ones counts fruits and veggies and aggregates everything into a neat map.

all_counts =
      table_A
      |> join(:left, [item_A], item_B in table_B,
        on:
          item_A.home_id == item_B.home_id and
            item_A.path == item_B.path
      )
      |> select([unfiltered_item, filtered_item], %{
        path: item_A.path,
        item_fruits_count: coalesce(item_A.fruits, 0),
        item_veggies_count: coalesce(item_B.veggies, 0),
        dataset_id: item_A.home_id
      })
      |> subquery()

The second one, joins 2 tables as well (items and photos), with nothing fancy:

file_info =
      table_C
      |> join(:inner, [item], file in table_D,
        on:
          item.id == file.item_id and not file.deleted
      )
      |> select([item, file], %{
        item_id: item.id,
        home_id: item.home_id,
        path: item.path,
        photo_key: file.photo_key
      })
      |> subquery()

Problem

Now the problem is that I need to merge these 2 together.
At first, one would think to do something like this:

result =
      all_counts
      |> join(:inner, [c], f in ^file_info, on: c.home_id == f.home_id and c.path == f.path)
      |> select([c, f], %{
        item_id: f.item_id,
        home_id: f.home_id,
        path: f.path,
        photo_key: f.photo_key,
        # ... you get the idea
      })
      |> Repo.all()

But this creates an issue, namely, the it will return so much data, the machines will run out of memory.

Approach

The approach I am using to solve this problem is to group items by home_id and path (since that is unique for each destination) and then return only a portion of the data I need, lets say, the top 3 items ordered by id.

Source:

Here is where my difficulties begin.
I cannot use pSQL directly, I must use Ecto (for reasons beyond this post).

Normally I would use CTEs or row_number():

With ctes:

 WITH cte AS
  ( SELECT name, value,
           ROW_NUMBER() OVER (PARTITION BY name
                              ORDER BY value DESC
                             )
             AS rn
    FROM t
  )
SELECT name, value, rn
FROM cte
WHERE rn <= 3
ORDER BY name, rn ;

With row_number:

SELECT name, value, rn
FROM 
  ( SELECT name, value,
           ROW_NUMBER() OVER (PARTITION BY name
                              ORDER BY value DESC
                             )
             AS rn
    FROM t
  ) tmp 
WHERE rn <= 3
ORDER BY name, rn ; 

However, I am not familiar enough with Ecto to know how to use them.

With CTEs, I understand I should avoid them, as they serve no purpose in Ecto:
https://hexdocs.pm/ecto/Ecto.Query.html#with_cte/3

With row_number() I would need to partition by both home_id and path (2 fields) instead of one:
https://hexdocs.pm/ecto/Ecto.Query.WindowAPI.html#row_number/0

Question

How do I get the result, to return the top 3 results, grouped by home_id and path and ordered by item_id using Ecto?

Marked As Solved

Fl4m3Ph03n1x

Fl4m3Ph03n1x

Solution(s)

From what I gathered, there are two decent possible solutions to this conundrum.

row_number + over

One of them is using row_number() with over() from Ecto:
https://hexdocs.pm/ecto/Ecto.Query.WindowAPI.html#row_number/0

Assuming I join both file_info with all_counts in a single table, I can then perform the query as mentioned in the previous SO post I mentioned in my question:

file_info_with_counts
|> select([fi], %{
        rn: over(row_number(), partition_by: [fi.home_id, fi.path], order_by: [asc: fi.item_id]),
        item_id: fi.item_id,
       # you get the idea ...
      })
|> subquery()

IO.inspect(file_info_with_counts |> where([c], c.rn <= 3) |> Repo.all()

Which prints what I wanted.

Source:

Lateral inner joins

However, as mentioned by some people in the community, this solution is rather old, and these days lateral joins seem to also cover this use case.

tops = 
  from top in "file_info_with_counts", 
    where: top.home_id == parent_as(:parent).home_id,
    where: top.path == parent_as(:parent).path,
    order_by: [asc: top.item_id],
    limit: 3,
    select: %{id: top.id}

from parent in "file_info_with_counts",
  as: :parent,
  group_by: [parent.home_id, parent.path],
  lateral_join: top in subquery(tops), 
  on: true,
  select: %{home_id: parent.home_id, path: parent.path, item_id: top.id}

Source:

This solution is not without merit, however, given my familiarity with row_number I opted for that solution instead.

Unless there is a considerable performance difference between the two in favor of lateral joins, I will keep the previous solution.

For more info, here is the original source where I got these answers:

Also Liked

andrea

andrea

Will Ecto be able to support NoSQL databases in the future?

Fl4m3Ph03n1x

Fl4m3Ph03n1x

While I am not an expert, I believe there is a package called mongodb_ecto:

So if you use MongoDB, this might be your thing.

Where Next?

Popular Backend topics Top

Rainer
Is there somewhere a good introduction to rust for experienced programmers (with years of C++/C#/Java experience)? Wanted to give it a t...
New
Kurisu
Hello and happy new year! I would like to buy a Ruby On Rails ebook for learning purpose. What would be the ROR equivalent of “Programm...
New
Jsdr3398
I’m trying to create a router where everything is in a collection of routes (similar to how I do my routes in expressjs). But it doesn’t ...
New
andrea
Can Phoenix LiveView be used in multi-page applications, unlike React/Vue/Blazor which seems to be targeted for SPA?
New
Ookma-Kyi
I am learning Elixir at the moment but am finding it difficult to learn. The issue is there aren’t any good tutorial sites out there when...
New
JimmyCarterSon
I am following this tutorial . I have followed carefully correcting errors as I go. The app allows you to create a blog post and add comm...
New
sona11
If isReachable throws an IOException in Java, what is the right step to do and why? The application, I believe, should halt the process ...
New
Fl4m3Ph03n1x
Background I have a module that uses TypedStruct to create structs. This is the code: defmodule Shared.Data.Authorization do @moduledo...
New
Fl4m3Ph03n1x
Background I have a release file inside a tarball. However I want the final release to have some additional files and to move things aro...
New
jaeyson
Hi! I have clarifications (please correct me, as I mostly mix/confuse this details) with the following: The term RAG here where it read...
New

Other popular topics Top

AstonJ
Thanks to @foxtrottwist’s and @Tomas’s posts in this thread: Poll: Which code editor do you use? I bought Onivim! :nerd_face: https://on...
New
AstonJ
Inspired by this post from @Carter, which languages, frameworks or other tech or tools do you think is killing it right now? :upside_down...
New
AstonJ
Do the test and post your score :nerd_face: :keyboard: If possible, please add info such as the keyboard you’re using, the layout (Qw...
New
PragmaticBookshelf
Learn different ways of writing concurrent code in Elixir and increase your application's performance, without sacrificing scalability or...
New
AstonJ
Seems like a lot of people caught it - just wondered whether any of you did? As far as I know I didn’t, but it wouldn’t surprise me if I...
New
mafinar
Crystal recently reached version 1. I had been following it for awhile but never got to really learn it. Most languages I picked up out o...
New
foxtrottwist
A few weeks ago I started using Warp a terminal written in rust. Though in it’s current state of development there are a few caveats (tab...
New
AstonJ
If you want a quick and easy way to block any website on your Mac using Little Snitch simply… File &gt; New Rule: And select Deny, O...
New
PragmaticBookshelf
Develop, deploy, and debug BEAM applications using BEAMOps: a new paradigm that focuses on scalability, fault tolerance, and owning each ...
New
AstonJ
If you’re getting errors like this: psql: error: connection to server on socket “/tmp/.s.PGSQL.5432” failed: No such file or directory ...
New