Danielle Navarro’s Post

4mo

Hello lovely people. It's been a minute since I wrote anything new, but after a tiresome hiatus I am back. New blog post discussing the "table1" package for building tables of descriptive statistics in #rstats. It's nice to be able to write again! https://lnkd.in/gaA4aMUk

Notes from a data witch - Making tables in R with table1

blog.djnavarro.net

5 Comments

Justin Bélair

Biostatistician in Science & Tech | Consultant | Causal Inference Specialist | Founder & Editor @ biostatistics.ca

4mo

I've been using tableone, but this post makes table1 seem like a lot more fun 😋

1 Reaction

Andrew Gard

Equitable Equations: data science for everyone

4mo

I link your blog in my intro to DS syllabus, so yesyesyes

1 Reaction

Michael Großbach

IT, Experimental Design, and Statistical Analysis at Hanover University of Music, Drama and Media

4mo

Had read about the package before but forgot about it. Your blog post was a nice reminder to start diving into it as that was exactly what I needed to do today: make a Table 1. Thanks a lot!

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

RCollab

537 followers
1mo
Report this post
Create statistical summaries in R with the summarise() function from dplyr 🔗 👉 https://buff.ly/47wvDZf via @RCoderWeb #RStats #DataWrangling #DataCleaning #DataScience #DataAnalytics #Stats #DataViz #dplyr

Create statistical summaries in R with the summarise() function from dplyr

r-coder.com
Like Comment
To view or add a comment, sign in
AAKASH MAJUMDER

FRESHER DATA SCIENCE STUDENT || STARTUP ENTHUSIAST || MAANG ENTHUSIAST || ML ENTHUSIAST || DL ENTHUSIAST || STATISTICAL ANALYSIS || AMITY UNIVERSITY || KALYANI UNIVERSITY
8mo
Report this post
just learnt about the pandas .cut() function with the help of which we can convert a continuous column into a category column by binning the data into various categories as desired: pandas.cut syntax: pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)[source] Bin values into discrete intervals. Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins. how i used it in my code: # defining the bin edges bin_edges = [-26, 100, 500, 5000, float('inf')] # Define your own bin edges as needed # Define the bin labels bin_labels = ['small','medium', 'large', 'Very large'] # Convert the continuous column to categories using binning num_df['no_of_employees'] = pd.cut(num_df['no_of_employees'], bins=bin_edges, labels=bin_labels) # lets perform correlation analysis to explore the relationship between bins of num of employees and visa approval rates among those who applied for those. sample_df=pd.crosstab(num_df['no_of_employees'],df['case_status']) row_total=sample_df.sum(axis=1) row_total sample_df = (sample_df.div(row_total,axis= 0)*100).round() sample_df output i got: case_status 0 1 no_of_employees small 38% 62% medium 34% 66% large 33% 67% Very large 30% 70% what i inferred: we can see that employees application of very large and large companies are more likely to be approved as compared to their counterparts. Boktiar Ahmed Bappy Krish Naik sudhanshu kumar Sunny Savita

1 Comment
Like Comment
To view or add a comment, sign in
Robin Wilson

Sr. Data Architect
5mo
Report this post
Big Data on Your Laptop? (cont.) Here's the code for the Big Data on Your Laptop experiment. Pls comment if you have any suggestions to improve the accuracy of this measurement. Thanks! Code: import pandas as pd import numpy as np import time from datetime import datetime, timedelta import random def rand_str(length=10): letters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ' return ''.join(random.choice(letters) for _ in range(length)) def rand_datetime(start, end): return start + timedelta(seconds=random.randint(0, int((end - start).total_seconds()))) def gen_dates(start, end, spec_date, num, frac=0.001): spec_dates = [spec_date + timedelta(minutes=random.randint(0, 59), seconds=random.randint(0, 59)) for _ in range(int(num * frac))] other_dates = [rand_datetime(start, end) for _ in range(num - len(spec_dates))] return spec_dates + other_dates num_rows, num_float, num_string, num_date = 10000000, 3, 5, 2 start_date, end_date, spec_date_hour = datetime(2020, 1, 1), datetime(2023, 12, 31), datetime(2023, 12, 31, 13) total_start = time.time() start = time.time() data1 = {f'float{i}': np.random.rand(num_rows) for i in range(num_float)} data1.update({f'str{i}': [rand_str() for _ in range(num_rows)] for i in range(num_string)}) data1.update({f'date{i}': gen_dates(start_date, end_date, spec_date_hour, num_rows) for i in range(num_date)}) df1 = pd.DataFrame(data1) time1 = time.time() - start print(f"DataFrame 1: {time1:.2f} s") print(f"Total rows in DataFrame 1: {df1.shape[0]}") start = time.time() data2 = {f'float{i}': np.random.rand(num_rows) for i in range(num_float)} data2.update({f'str{i}': [rand_str() for _ in range(num_rows)] for i in range(num_string)}) data2.update({f'date{i}': gen_dates(start_date, end_date, spec_date_hour, num_rows) for i in range(num_date)}) df2 = pd.DataFrame(data2) time2 = time.time() - start print(f"DataFrame 2: {time2:.2f} s") print(f"Total rows in DataFrame 2: {df2.shape[0]}") start_hour, end_hour = spec_date_hour.replace(minute=0, second=0, microsecond=0), spec_date_hour.replace(minute=59, second=59, microsecond=999999) df1_filtered = df1[(df1['date0'] >= start_hour) & (df1['date0'] <= end_hour)] df2_filtered = df2[(df2['date0'] >= start_hour) & (df2['date0'] <= end_hour)] print(f"DF1 rows in hour {spec_date_hour}: {df1_filtered.shape[0]}") print(f"DF2 rows in hour {spec_date_hour}: {df2_filtered.shape[0]}") start = time.time() joined_df = df1.merge(df2, left_on='date0', right_on='date0', suffixes=('_df1', '_df2')).query("date0 >= @start_hour & date0 <= @end_hour") join_time = time.time() - start print(f"Join time: {join_time:.2f} s") print(f"Joined rows: {joined_df.shape[0]}") total_exec_time = time1 + time2 + join_time total_resp_time = time.time() - total_start latency = total_resp_time - total_exec_time print(f"Total response time: {total_resp_time:.2f} s") print(f"Total execution time: {total_exec_time:.2f} s") print(f"Latency: {latency:.2f} s")
Like Comment
To view or add a comment, sign in
Diana Schmidt

Weekend Data Analyst 📊💭
11mo Edited
Report this post
I usually skip most of the articles in Dataelixir newsletter (I am not on so high technical level), but sometimes I find content that is enjoyable and fun to read. I highly recommend reading "Friends Don't Let Friends Make Bad Graphs" and hope that my friends won't let me make bad graphs (I am somewhat fond of 🌈 colors , though) https://lnkd.in/dRMJxSt6

GitHub - cxli233/FriendsDontLetFriends: Friends don't let friends make certain types of data visualization - What are they and why are they bad.

github.com
Like Comment
To view or add a comment, sign in
Vijayashree S

Data Analyst | Data Specialist
8mo
Report this post
Handling missing values is a crucial part of data analysis, and Pandas provides several methods for dealing with them effectively. Here are some common techniques for handling missing values in Pandas: Identifying Missing Values: Before handling missing values, it's essential to identify them. In Pandas, missing values are often represented as NaN (Not a Number) or None. You can use functions like isnull() or isna() to identify missing values in a DataFrame 1.Dropping Missing Values: One simple approach to handling missing values is to remove rows or columns containing them. You can use the dropna() method to drop rows or columns with missing values. 2.Filling Missing Values: Instead of dropping missing values, you can fill them with specific values. The fillna() method can be used to fill missing values with a constant, the mean, median, or any other desired value.

1 Comment
Like Comment
To view or add a comment, sign in
Janet Osikomaiya

I.T Sales and Support
1mo Edited
Report this post
#30DaysDataScienceLearning – Day 10: Handling Missing Values in Pandas Today, I focused on tackling missing values in datasets. Missing data is quite common and can skew our analysis if not handled properly. Below are some of the techniques I explored for dealing with null values in pandas: 1. Filling with a Constant Value Sometimes, you might want to fill missing values with a specific value, such as 0 or 'Unknown'. This can be done using the fillna() function. import pandas as pd # Example DataFrame with missing values data = {'Name': ['Alice', 'Bob', 'Charlie', None], 'Age': [25, None, 35, 28]} df = pd.DataFrame(data) # Fill null values with a constant df_filled_constant = df.fillna(value={'Name': 'Unknown', 'Age': 0}) print(df_filled_constant) 2. Filling with Mean, Median, or Mode For numerical columns, it's common to fill missing values with the mean, median, or mode of the column. # Fill missing 'Age' with the mean df['Age'] = df['Age'].fillna(df['Age'].mean()) # Alternatively, you can use median or mode # df['Age'] = df['Age'].fillna(df['Age'].median()) # df['Age'] = df['Age'].fillna(df['Age'].mode()[0]) print(df) 3. Forward Fill (propagating the last valid observation) The forward fill method (ffill) replaces null values with the previous row’s value, useful for time-series data. # Forward fill missing values df_filled_ffill = df.fillna(method='ffill') print(df_filled_ffill) 4. Backward Fill (using the next valid observation) Similarly, backward fill (bfill) replaces null values with the next row’s value. # Backward fill missing values df_filled_bfill = df.fillna(method='bfill') print(df_filled_bfill) 5. Interpolate Missing Values For numerical data, interpolation estimates the missing values based on other data points. This is helpful in filling gaps in time-series data. # Interpolate missing values df_interpolated = df['Age'].interpolate(method='linear') print(df_interpolated) 6. Drop Rows or Columns with Missing Values If missing values are too numerous, or we can't estimate them accurately, we may choose to drop rows or columns. # Drop rows with any missing values df_dropped = df.dropna() print(df_dropped) # Drop columns with any missing values df_dropped_columns = df.dropna(axis=1) print(df_dropped_columns) By using these methods, we can make sure that our analysis isn't affected by missing data! What strategies do you use to deal with missing values? Let me know in the comments! #DataScience #Pandas #Python #DataCleaning #MachineLearning #Analytics #DataAnalysis #30DaysOfCode #3MTT
Like Comment
To view or add a comment, sign in
Selva M Kadalarasan

Transformative and Strategic Agile Leader | Expert in Digital Transformation & High-Availability Software Systems
9mo
Report this post
It’s the start of a new year and perhaps you’re looking to break into the RAG scene by building your very first RAG system. Or, maybe you’ve built Basic RAG systems and are now looking to enhance them to something more advanced in order to better handle your user’s queries and data structures.

A Cheat Sheet and Some Recipes For Building Advanced RAG

blog.llamaindex.ai
Like Comment
To view or add a comment, sign in
Trevor Fox

Move fast and build things
8mo
Report this post
To recap this week in data on LinkedIn: 1. Semantic layers are a thing. 2. In a wild reversal from last week, you can use LLMs to write SQL. 3. Just remember to buckle your semantic layers before you start your LLM. 4. Humans are, as of this week, required to build and maintain a semantic layer. The principle that underlies this: LLMs require predictable contexts to provide reliable outputs. But... data exists on a spectrum from orderly to chaotic. And LLMs are more efficient in creating order in chaos than humans. So, depending on the nature of your data and the application, LLMs might be the best tool to order your data. Next week: LLMs will be the best way to create a semantic layer? Finally, in other news, prepositions are okay to end a sentence with. #data #datamodeling #semanticlayer #analytics

11 Comments
Like Comment
To view or add a comment, sign in
ScholarSpotlight

24 followers
8mo
Report this post
Now time to plot regression in #stata learn here: https://lnkd.in/gHZpkNdT

Making Regression Coefficient Plots in Stata

medium.com
Like Comment
To view or add a comment, sign in

1,338 followers

14 Posts

View Profile Follow

Danielle Navarro’s Post

More Relevant Posts

Explore topics