Return array from Linq SQL query on 80 million row table using EFCore

c# entity-framework-core linq sql-server

Question

I'm querying a SQL database using Linq and Entity Framework Core in a Razorpages project to generate a residency plot, here's one that I made earlier.

I'm struggling to optimise this query despite many attempts and iterations, it is slow and it often times out. I need the final array of the Count() values that make up each square of the residency plot and am not interested in the raw data.

The data are from a table with ~80 million rows and I've found solutions from SO which might work with fewer entries but that aren't suitable in this use case (generally searching Linq, group, join). I think the problem is the combination of filters, groups and joins followed by a count occurring server-side without first downloading the raw data.

Reviewing the SQL command in SSMS (pulled from LINQPad) it is very poorly optimised - I can post this if it would be useful but it's 236 lines long made up of repeated sections.

The Linq I've hobbled together performs the required operation in the 4 steps outlined here.

Step 1 (rows between a certain time, with a certain LocationTypeId, and a channelId = engSpeed):

var speedRows = context.TestData
.Where(a => a.Time >= start
&& a.Time < end
&& a.LocationTypeId == 3
&& a.channelId == 7)
.Select(s => new
{
    s.Time,
    s.ChannelValue
})
.Distinct();

Step 2 (rows with a channelId = torque):

var torqueRows = context.TestData
.Where(a => a.LocationTypeId == 3
&& a.channelId == 8)
.Select(s => new
{
    s.Time,
    s.ChannelValue
})
.Distinct();

Step 3 (join the speed and torque rows from Step 1 and Step 2 on Time):

var joinedRows = speedRows.Join(torqueRows, arg => arg.Time, arg => arg.Time,
    (speed, torque) => new
    {
        Id = speed.Time,
        Speed = Convert.ToDouble(speed.ChannelValue),
        Torque = Convert.ToInt16(torque.ChannelValue)
    });

Step 4 (create the dynamic groupings using the joined table from Step 3):

var response = (from a in joinedRows
            group a by (a.Torque / 100) into torqueGroup
            orderby torqueGroup.Key
            select new
            {
                TorqueBracket = $"{100 * torqueGroup.Key} <> {100 + (100 * torqueGroup.Key)}",
                TorqueMin = 100 * torqueGroup.Key,
                TorqueMax = 100 + (100 * torqueGroup.Key),
                Speeds = (from d in torqueGroup
                            group d by (Math.Floor((d.Speed) / 500)) into speedGroup
                            orderby speedGroup.Key
                            select new
                            {
                                SpeedBracket = $"{500 * speedGroup.Key} <> {500 + (500 * speedGroup.Key)}",
                                SpeedMin = 500 * (int)speedGroup.Key,
                                SpeedMax = 500 + (500 * (int)speedGroup.Key),
                                Minutes = speedGroup.Count()
                            })
            }).ToList();

I could be missing something obvious but I've tried many attempts and this is the best I've got.

The TestData class:

public partial class TestData {
    public int LiveDataId { get; set; }
    public DateTime? Time { get; set; }
    public int? LocationTypeId { get; set; }
    public int? TestNo { get; set; }
    public int? LogNo { get; set; }
    public int? LiveDataChannelId { get; set; }
    public decimal? ChannelValue { get; set; }
    public virtual LiveDataChannelNames LiveDataChannel { get; set; }
    public virtual LocationType LocationType { get; set; }
}

Any help or pointers would be appreciated.

Thank you.

1
1
2/3/2020 6:16:03 PM

Accepted Answer

Although the answers posted here were helpful in making me understand the intricacies of the problem (thank you to all that posted) they did not work with Entity Framework Core version 2.2.6. I've managed to get reasonable and stable performance with the C# code below.

Crucially in steps 1 and 2 the .ToList() stops timeouts on longer queries by (I assume) dividing the enumeration of results, possibly with a small time penalty. Also the (double) and (short) conversions are performed server-side natively as opposed to Convert.ToDouble and Convert.ToInt16 respectively.

Step 1 (rows between a certain time, with a certain LocationTypeId, and a channelId = engSpeed):

var speedRows = context.TestData
.Where(a => a.Time >= start
&& a.Time < end
&& a.LocationTypeId == 3
&& a.channelId 7)
.Select(s => new
{
    s.Time,
    ChannelValue = (double)s.ChannelValue
})
.Distinct().ToList();

Step 2 (rows with a channelId = torque):

var torqueRows = context.TestData
.Where(a => a.LocationTypeId == 3
&& a.channelId == 8)
.Select(s => new
{
    s.Time,
    ChannelValue = (short)s.ChannelValue
})
.Distinct().ToList();

Step 3 (join the speed and torque rows from Step 1 and Step 2 on Time):

var joinedRows = speedRows.Join(torqueRows, arg => arg.Time, arg => arg.Time,
    (speed, torque) => new
    {
        Id = speed.Time,
        Speed = speed.ChannelValue,
        Torque = torque.ChannelValue
    });

Step 4 (group the rows into keyed groups)

var groupedData = joinedRows
    .GroupBy(arg => new { TorqueGroupKey = (arg.Torque / 100), SpeedGroupKey = Math.Floor((arg.Speed) / 500) })
    .Select(g => new
    {
        g.Key.TorqueGroupKey,
        g.Key.SpeedGroupKey,
        Minutes = g.Count()
    });

Step 5 (create the dynamic groupings using groupedData from Step 4):

var response = (from a in groupedData.AsEnumerable()
            group a by a.TorqueGroupKey into torqueGroup
            orderby torqueGroup.Key
            select new ResidencySqlResult
            {
                TorqueBracket = $"{100 * torqueGroup.Key} <> {100 + (100 * torqueGroup.Key)}",
                TorqueMin = 100 * torqueGroup.Key,
                TorqueMax = 100 + (100 * torqueGroup.Key),
                Speeds = (from d in torqueGroup
                          orderby d.SpeedGroupKey
                          select new Speeds
                          {
                              SpeedBracket = $"{500 * d.SpeedGroupKey} <> {500 + (500 * d.SpeedGroupKey)}",
                              SpeedMin = 500 * (int)d.SpeedGroupKey,
                              SpeedMax = 500 + (500 * (int)d.SpeedGroupKey),
                              Minutes = d.Minutes
                          })
            }).ToList();

Thank you again to all those that helped out.

0
2/10/2020 11:38:48 AM

Popular Answer

I doubt the actual generated SQL command is so big - you've probably are checking the SQL command generated by EF6.

The generated SQL by EF Core is not so big, but the problem is that the Speeds = ... part of the GroupBy cannot be translated to SQL, and is evaluated client side after retrieving all the data from the previous parts of the query.

What you can do is the create intermediate query which retries only the data needed (2 groping keys + count) and the do the rest client side.

First you need to make sure that the subqueries from Step 1, 2 and 3 are translatable to SQL. Convert.ToDouble and Convert.ToInt16 are not translatable, so replace them with casts:

Speed = (double)speed.ChannelValue,
Torque = (short)torque.ChannelValue

Then split the Step4 on two parts. The server part:

var groupedData = joinedRows
    .GroupBy(arg => new { TorqueGroupKey = arg.Torque / 100, SpeedGroupKey = Math.Floor((arg.Speed) / 500) })
    .Select(g => new
    {
        g.Key.TorqueGroupKey,
        g.Key.SpeedGroupKey,
        Minutes = g.Count()
    });

and the client part:

var response = (from a in groupedData.AsEnumerable() // <-- swicth to client evaluation
                group a by a.TorqueGroupKey into torqueGroup
                orderby torqueGroup.Key
                select new
                {
                    TorqueBracket = $"{100 * torqueGroup.Key} <> {100 + (100 * torqueGroup.Key)}",
                    TorqueMin = 100 * torqueGroup.Key,
                    TorqueMax = 100 + (100 * torqueGroup.Key),
                    Speeds = (from d in torqueGroup
                              orderby d.SpeedGroupKey
                              select new
                              {
                                  SpeedBracket = $"{500 * d.SpeedGroupKey} <> {500 + (500 * d.SpeedGroupKey)}",
                                  SpeedMin = 500 * (int)d.SpeedGroupKey,
                                  SpeedMax = 500 + (500 * (int)d.SpeedGroupKey),
                                  Minutes = d.Minutes
                              })
                }).ToList();

Note that in EF Core 3.0+ you'll be forced to do something like this because implicit client evaluation has been removed.

The generated SQL query now should be something like this:

  SELECT [t].[ChannelValue] / 100 AS [TorqueGroupKey], FLOOR([t].[ChannelValue] / 500.0E0) AS [SpeedGroupKey], COUNT(*) AS [Minutes]
  FROM (
      SELECT DISTINCT [a].[Time], [a].[ChannelValue]
      FROM [TestData] AS [a]
      WHERE ((([a].[Time] >= @__start_0) AND ([a].[Time] < @__end_1)) AND ([a].[LocationTypeId] = 3)) AND ([a].[LiveDataChannelId] = 7)
  ) AS [t]
  INNER JOIN (
      SELECT DISTINCT [a0].[Time], [a0].[ChannelValue]
      FROM [TestData] AS [a0]
      WHERE ([a0].[LocationTypeId] = 3) AND ([a0].[LiveDataChannelId] = 8)
  ) AS [t0] ON [t].[Time] = [t0].[Time]
  GROUP BY [t].[ChannelValue] / 100, FLOOR([t].[ChannelValue] / 500.0E0) 


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow