AG2PI Workshop #25 - April 10-12, 2024


Scientific Computing & Data Analytics: A Comprehensive Toolkit for Research

April 10-12, 2024 @ 11:00 AM - 01:00 PM (US Central Time)
Download Flyer
April 10-12, 2024
11:00 AM - 01:00 PM
(US Central Time)

Purpose

Hands-on training in efficient scientific computing techniques for both small and large data sets.

Registration

(Virtual Zoom Meeting)

Register for the virtual event by clicking the link below. Upon registration, you will receive a confirmation email with information about joining the meeting

Workshop Registration

Workshop Resources

Click the buttons below to access resources of this workshop

Recordings
Watch Day 1 Recording Watch Day 2 Recording Watch Day 3 Recording
Resources: Code and Google Colab Notebooks

Chat Questions

Questions directed to the speakers placed in the chat can be viewed by clicking the button below

See Chat Questions

To fully harness the potential in extensive data sets, it is essential to process the raw data using methods like information extraction, data mining, and knowledge discovery. This workshop series aims to equip you with a comprehensive computational toolkit for your research. This toolkit will enable you to effectively manage and analyze the increasing volumes of data in your field, thereby enhancing your research output.

The workshop series will cover:
  1. Fundamental concepts of Python, Jupyter notebooks, and GitHub
  2. Overview of machine learning and interactive visualization
  3. Real-world applications in the field of plant phenotyping

By the end of the workshop series, you'll be able to:
  1. Understand the fundamentals of scientific computing, including data preprocessing, statistical analysis, machine learning, and data visualization
  2. Incorporate a variety of scientific computing techniques into your research workflow
  3. Utilize established software to identify patterns within data sets
  4. Perform analyses with common machine learning tools, including neural networks and dimensionality reduction methods


About Presenters

Emmanuel Gonzalez

Emmanuel Gonzalez is a PhD candidate in Dr. Duke Pauli's lab at the University of Arizona whose work focuses on leveraging plant phenomics, data science, and machine learning to investigate how crops respond to both abiotic and biotic stress.


Jeffrey Demieville

Jeffrey Demieville is an interdisciplinary R&D Engineer at the University of Arizona whose work focuses on applying biological, agricultural, and systems engineering practices to the field of phenomics.


Brenda Huppenthal

Brenda Huppenthal is a computer science graduate student at the University of Arizona whose work focuses on using computer vision and specifically deep learning approaches to obtain underlying structural information from point clouds.


Emily Cawley

Emily Cawley is an undergraduate Computer Science student at the University of Arizona. As an undergraduate researcher in the Pauli lab, she works on predicting transformations of 3D data with neural networks.


Aditya Kumar

Aditya Kumar is an undergraduate researcher in the Pauli lab. He specializes in software development and data visualization and is currently focusing on refining object detection models for Sorghum Panicle Detection.


Bella Salter

Bella Salter is an undergraduate researcher in the Pauli lab whose work focuses on predicting late season lettuce growth based on early metrics of success using machine learning techniques such as long-short term memory.


Chat Questions

Session 1


In Google Colab, doing deep learning , images to upload takes lots of time and hangs. Is this true? or it is very easy to upload millions of pics for deep learning.

That is correct, data transfer can be time consuming and a bottleneck. Google Colab is limited by the amount of storage space on your Google Drive. This usually means that you will need to go beyond Google Colab notebooks at a certain point when you outgrow the storage space. Often people change to cloud computers or high performance computers with larger storage space. You can read more about Google's quotas and limits here.


Is the pro version of google colab require to run codes?

(Answer #1) No, you can run it using a free account. Just make sure you are logged in to your Google account to run the code.

(Answer #2) No, not when you are running code that only requires a CPU. If your code requires a GPU, Google does provide limited access to GPUs for free, but it depends on the load on the system at the time.


What does \n signify?

\n is a control character that signifies a new line.


Can someone tell what the bq_stripped function does?

bq_stripped is a variable that holds a string. Under the 'Processing' heading, you can see that we set bq_strippped = block_quote.strip(), so it contains original block quote with leading and trailing spaces stripped. If you'd like to read more about the strip function.


Session 2


In linear regression, we set intercept as one or it is by default one. Suppose we want to put value of beta (intercept) how can we do it in python?

For more information on passing the intercept to the scikit-learn function. You can pass the model the intercept with the intercept_ parameter


In compressing data, we need to do down sample or log-log function? can you please explain.

For 3D processing, point clouds are downsampled using voxel downsampling. This allows us to more easily visualize/analyze the data. You can read more about voxel downsampling here.


Why do we need Mapbox api key? It is paid, how expensive?

The Mapbox api key is free. However, there are some limitations with the # of data points that can be visualized. You can read more about Mapbox and Plotly express here.

(Follow-up Question) Adding to it: Can we use ARCGIS Satellite mapping

(Answer Follow-up #1) I am not sure if ArcGIS satellite mapping is supported. The maps used in Mapbox are provided by OpenStreetMap. You can read more about Mapbox here.

(Answer Follow-up #2) Here is additional documentation from Plotly. The relevant links within it illustrate when you need an access tokenhttps://plotly.com/python/scattermapbox/.


Session 3


What topological traits are going the supplement in phenomics domain?

We can use Euler characteristic curves to assess the shape signature along an axis or use persistence diagram-based features such as amplitude, persistence entropy. These features provide some insights into a plant's shape in addition to traditional traits such as height, volume, etc.


Can we use transformers for test and train such as EfficientNetV2 , ResNet etc? what is the difference in using this and the code currently?

The code currently being explained is splitting the data into train, validation, and test set. This is done randomly using the Python package sklearn. Once split, you can use model architectures like the EfficientNetV2 that you mentioned.

(Follow-up Question) Can 'merge_geotiff' do image stitching or orthomosaicking?

(Answer Follow-up) Here is the documentation for the rasterio.merge module used in Brenda's notebook