Fl4m3Ph03n1x

Fl4m3Ph03n1x

How to get the top X results of a given category using Ecto?

Background

I have to queries that return a colossal amount of data on their own. I cannot use Repo.all as doing so would materialize these into memory, which would then quickly run out.

So I am trying to push as much as I can to the pSQL DB, and force the DB to do as much work as possible.

My issue starts with 2 queries.

This ones counts fruits and veggies and aggregates everything into a neat map.

all_counts =
      table_A
      |> join(:left, [item_A], item_B in table_B,
        on:
          item_A.home_id == item_B.home_id and
            item_A.path == item_B.path
      )
      |> select([unfiltered_item, filtered_item], %{
        path: item_A.path,
        item_fruits_count: coalesce(item_A.fruits, 0),
        item_veggies_count: coalesce(item_B.veggies, 0),
        dataset_id: item_A.home_id
      })
      |> subquery()

The second one, joins 2 tables as well (items and photos), with nothing fancy:

file_info =
      table_C
      |> join(:inner, [item], file in table_D,
        on:
          item.id == file.item_id and not file.deleted
      )
      |> select([item, file], %{
        item_id: item.id,
        home_id: item.home_id,
        path: item.path,
        photo_key: file.photo_key
      })
      |> subquery()

Problem

Now the problem is that I need to merge these 2 together.
At first, one would think to do something like this:

result =
      all_counts
      |> join(:inner, [c], f in ^file_info, on: c.home_id == f.home_id and c.path == f.path)
      |> select([c, f], %{
        item_id: f.item_id,
        home_id: f.home_id,
        path: f.path,
        photo_key: f.photo_key,
        # ... you get the idea
      })
      |> Repo.all()

But this creates an issue, namely, the it will return so much data, the machines will run out of memory.

Approach

The approach I am using to solve this problem is to group items by home_id and path (since that is unique for each destination) and then return only a portion of the data I need, lets say, the top 3 items ordered by id.

Source:

Here is where my difficulties begin.
I cannot use pSQL directly, I must use Ecto (for reasons beyond this post).

Normally I would use CTEs or row_number():

With ctes:

 WITH cte AS
  ( SELECT name, value,
           ROW_NUMBER() OVER (PARTITION BY name
                              ORDER BY value DESC
                             )
             AS rn
    FROM t
  )
SELECT name, value, rn
FROM cte
WHERE rn <= 3
ORDER BY name, rn ;

With row_number:

SELECT name, value, rn
FROM 
  ( SELECT name, value,
           ROW_NUMBER() OVER (PARTITION BY name
                              ORDER BY value DESC
                             )
             AS rn
    FROM t
  ) tmp 
WHERE rn <= 3
ORDER BY name, rn ; 

However, I am not familiar enough with Ecto to know how to use them.

With CTEs, I understand I should avoid them, as they serve no purpose in Ecto:
https://hexdocs.pm/ecto/Ecto.Query.html#with_cte/3

With row_number() I would need to partition by both home_id and path (2 fields) instead of one:
https://hexdocs.pm/ecto/Ecto.Query.WindowAPI.html#row_number/0

Question

How do I get the result, to return the top 3 results, grouped by home_id and path and ordered by item_id using Ecto?

Marked As Solved

Fl4m3Ph03n1x

Fl4m3Ph03n1x

Solution(s)

From what I gathered, there are two decent possible solutions to this conundrum.

row_number + over

One of them is using row_number() with over() from Ecto:
https://hexdocs.pm/ecto/Ecto.Query.WindowAPI.html#row_number/0

Assuming I join both file_info with all_counts in a single table, I can then perform the query as mentioned in the previous SO post I mentioned in my question:

file_info_with_counts
|> select([fi], %{
        rn: over(row_number(), partition_by: [fi.home_id, fi.path], order_by: [asc: fi.item_id]),
        item_id: fi.item_id,
       # you get the idea ...
      })
|> subquery()

IO.inspect(file_info_with_counts |> where([c], c.rn <= 3) |> Repo.all()

Which prints what I wanted.

Source:

Lateral inner joins

However, as mentioned by some people in the community, this solution is rather old, and these days lateral joins seem to also cover this use case.

tops = 
  from top in "file_info_with_counts", 
    where: top.home_id == parent_as(:parent).home_id,
    where: top.path == parent_as(:parent).path,
    order_by: [asc: top.item_id],
    limit: 3,
    select: %{id: top.id}

from parent in "file_info_with_counts",
  as: :parent,
  group_by: [parent.home_id, parent.path],
  lateral_join: top in subquery(tops), 
  on: true,
  select: %{home_id: parent.home_id, path: parent.path, item_id: top.id}

Source:

This solution is not without merit, however, given my familiarity with row_number I opted for that solution instead.

Unless there is a considerable performance difference between the two in favor of lateral joins, I will keep the previous solution.

For more info, here is the original source where I got these answers:

Also Liked

andrea

andrea

Will Ecto be able to support NoSQL databases in the future?

Fl4m3Ph03n1x

Fl4m3Ph03n1x

While I am not an expert, I believe there is a package called mongodb_ecto:

So if you use MongoDB, this might be your thing.

Where Next?

Popular Backend topics Top

Kurisu
Hello, Please, let’s say I have a website with user authentication made with Elixir/Phoenix, and now want to add a forum to it (using a ...
New
New
sampu
I have a use case where a client is invoking a Rest endpoint via a load balancer, which in turn invokes a third party endpoint which is r...
New
AstonJ
If when trying to create (or recreate) your dev db with rails db:create you are getting: PG::ConnectionBad: connection to server on soc...
New
Fl4m3Ph03n1x
Background I am moving towards defined data structures in my application, and I find that TypedStruct is quite useful. Questions Howeve...
New
sona11
If isReachable throws an IOException in Java, what is the right step to do and why? The application, I believe, should halt the process ...
New
harwind
I received this error for a binary search programme in C, despite the fact that it requested for inputs and produced the right output. Th...
/c
New
Fl4m3Ph03n1x
Background I have a release file inside a tarball. However I want the final release to have some additional files and to move things aro...
New
Fl4m3Ph03n1x
Background I have an umbrella project, where I run mix test from the root. In one of the apps, I am mocking the File module using the Mo...
New
apoorv-2204
Anyone know how to get in golang? I am from elixir background?.
New

Other popular topics Top

Exadra37
Please tell us what is your preferred monitor setup for programming(not gaming) and why you have chosen it. Does your monitor have eye p...
New
dasdom
No chair. I have a standing desk. This post was split into a dedicated thread from our thread about chairs :slight_smile:
New
AstonJ
We have a thread about the keyboards we have, but what about nice keyboards we come across that we want? If you have seen any that look n...
New
AstonJ
This looks like a stunning keycap set :orange_heart: A LEGENDARY KEYBOARD LIVES ON When you bought an Apple Macintosh computer in the e...
New
dimitarvp
Small essay with thoughts on macOS vs. Linux: I know @Exadra37 is just waiting around the corner to scream at me “I TOLD YOU SO!!!” but I...
New
PragmaticBookshelf
Create efficient, elegant software tests in pytest, Python's most powerful testing framework. Brian Okken @brianokken Edited by Kat...
New
New
AstonJ
If you’re getting errors like this: psql: error: connection to server on socket “/tmp/.s.PGSQL.5432” failed: No such file or directory ...
New
PragmaticBookshelf
Get the comprehensive, insider information you need for Rails 8 with the new edition of this award-winning classic. Sam Ruby @rubys ...
New
PragmaticBookshelf
A concise guide to MySQL 9 database administration, covering fundamental concepts, techniques, and best practices. Neil Smyth MySQL...
New