Introduction to data.table2 days ago
Data analysis using data.table | Data | Introduction | 1. Basics | a) What is data.table? | Note that: | b) General form - in what way is a data.table enhanced? | The way to read it (out loud) is: | c) Subset rows in i | -- Get all the flights with "JFK" as the origin airport in the month of June. | -- Get the first two rows from flights. | -- Sort flights first by column origin in ascending order, and then by dest in descending order: | order() is internally optimised | d) Select column(s) in j | -- Select arr_delay column, but return it as a vector. | -- Select arr_delay column, but return as a data.table instead. | Tip: | -- Select both arr_delay and dep_delay columns. | -- Select both arr_delay and dep_delay columns and rename them to delay_arr and delay_dep. | e) Compute or do in j | -- How many trips have had total delay < 0? | What's happening here? | f) Subset in i and do in j | -- Calculate the average arrival and departure delay for all flights with "JFK" as the origin airport in the month of June. | -- How many trips have been made in 2014 from "JFK" airport in the month of June? | g) Handle non-existing elements in i | -- What happens when querying for non-existing elements? | Special symbol .N: | h) Great! But how can I refer to columns by names in j (like in a data.frame)? | -- Select both arr_delay and dep_delay columns the data.frame way. | -- Select columns named in a variable using the .. prefix | -- Select columns named in a variable using with = FALSE | 2. Aggregations | a) Grouping using by | -- How can we get the number of trips corresponding to each origin airport? | -- How can we calculate the number of trips for each origin airport for carrier code "AA"? | -- How can we get the total number of trips for each origin, dest pair for carrier code "AA"? | -- How can we get the average arrival and departure delay for each orig,dest pair for each month for carrier code "AA"? | b) Sorted by: keyby | -- So how can we directly order by all the grouping variables? | c) Chaining | -- How can we order ans using the columns origin in ascending order, and dest in descending order? | d) Expressions in by | -- Can by accept expressions as well or does it just take columns? | e) Multiple columns in j - .SD | -- Do we have to compute mean() for each column individually? | Special symbol .SD: | -- How can we specify just the columns we would like to compute the mean() on? | .SDcols | f) Subset .SD for each group: | -- How can we return the first two rows for each month? | g) Why keep j so flexible? | -- How can we concatenate columns a and b for each group in ID? | -- What if we would like to have all the values of column a and b concatenated, but returned as a list column? | Summary | Using i: | Using j: | Using by: | And remember the tip:
